When Data Doesn’t Tell the Whole Story
This blog is a part of International Justice Monitor’s technology for truth series, which focuses on the use of technology for evidence and features views from key proponents in the field.
As highlighted by other posts in this series, emerging technology is increasing the amount and type of information available, in some contexts, to criminal and other investigations. Much of what is produced by these emerging technologies (Facebook posts, tweets, YouTube videos, text messages) falls in the category we refer to as “found” data. By “found” data we mean data not generated for a specific investigation, but instead, that is generated for some unrelated purpose.
Such data are not new or specific to emerging technology: examples include bureaucratic and administrative records, such as the Historic Archive of the National Police in Guatemala, documents generated by the Documentation and Security Directorate (DDS) in Chad, border crossing records kept by the Albanian border guards during the 1999 conflict between Yugoslavia and NATO, andcemetery registries in Colombia.
Data that has been initially collected for different purposes, or perhaps generated with no specific purpose in mind, has long provided starting points for investigations by truth commissions, legal teams, inquiries by the United Nations, journalists, and other organizations. Emerging technology is expanding the list of possible sources of this valuable data even more. However, many of the challenges and uses of such data remain the same, whether the data are physical (e.g., archived paper documents) or digital (e.g., YouTube videos).
One of the primary challenges to “found” data is simply preserving it – photocopying border crossing records in Albania, taking digital photographs of decaying documents in Chad, or archiving YouTube videos and Facebook posts.
Once preserved, criminal investigators will need processes to both sift through a large amount of data to identify information most pertinent to case investigations and to verify the authenticity of information. Several organizations are tackling this problem as described in a report by the Human Rights Center at University of California, Berkeley.
In addition to potentially unique challenges of preserving and meeting the demands of the chain of custody, data generated by emerging technology also pose many of the same challenges as traditional data sources regarding questions of how the data were collected and what information the data include and exclude.
The experience of the Human Rights Data Analysis Group (HRDAG) over the past two decades in more than 30 countries has taught us that most data sources are incomplete, regardless of how systematically, carefully, or technologically they may have been collected. This is not to question the validity of any particular set of data, but rather a recognition that raw data is merely a starting point. Data must be contextualized with qualitative information and analyzed using appropriate statistical techniques that take into account the way in which the data were collected and what may be missing from the data.
For example, regardless of the data source, researchers should always ask themselves, “Whose stories are not captured by this source? How are those stories, victims, and witnesses likely to differ from the ones documented by this source?”
Field experts can address these questions qualitatively. Recent research in Syria has indicated that ongoing documentation efforts are less able to reach certain geographic areas, sometimes due to lack of infrastructure (no electricity or cell phone coverage) and other times due to which group has control of a region and the affect this has on the relationship between community members and data collectors.
Statistical analyses, under certain conditions, can directly estimate the amount of information missing from raw data. Our own work at HRDAG focuses primarily on a class of methods called multiple systems estimation, which uses the overlaps between several incomplete lists of human rights violations to determine the total number of violations, but there are a variety of analytical techniques designed to account and adjust for missing data.
This poses its own challenges, as not only do such analyses require experts trained in appropriate methodology, but if such analyses are to be presented in court they also require the experts to be able to adequately explain potentially complex methods to a not-necessarily-technical audience of lawyers and judges. These challenges must be met, as quantitative analyses that fail to account for missing data tell only part of the story, at best, and at worst may draw precisely the wrong conclusions.