Privacy Contact Us
HRDAG: Human Rights Data Analysis Group

Consulting

About HRDAG

People

Projects

Awards

Partners

FAQs

Resources

Core Concepts

Data and Software

Publications

Press Releases

Links

Home

Core Concepts

HRDAG Notes on Data Analysis Technology and Research

The HRP team is recognized around the world for its application of innovative methods and statistical models that bring clarity to complex data. Our outcomes have been used successfully to counter the claims of powerful leaders, and we work hard to assure that our arguments are as thoroughly tested as existing technology and available resources permit. We have identified seven characteristics that human rights researchers need to take into account, which include:

  • Under-registration: No one information system can capture all the information related to every human rights violation in the universe of interest. A data collection project's access may be limited by a number of factors, which include geography, available resources and the populations that are willing to share their knowledge. The information available in any given system is a sample of violations not necessarily representative of the real world.
  • Selection bias: Every data source will have better access to some victims than to others, creating statistical bias. For example, bias can arise from limited access to unreachable or dangerous areas. The most frequent source of bias is that some victims or witnesses will trust the organization collecting data, while others will not.
  • Complexity in a human rights violation or event: Human rights abuses are rarely straightforward. For instance, a database on assassinations may also include information about detentions or torture that victims experienced before they were killed. If all three violation types are of interest to the project, all three violation types must be represented in the database. Collecting information only on the "most important" violation will obscure instances of the other violations, creating significant biases in the analysis.[1]
  • Duplicate reporting: sources often include several accounts of the same violations because multiple witnesses may have reported on an event. We need to identify which records describe the same events and participants in order not to double- (or triple-) count them. In some projects, we are able to use the duplicated reporting to estimate the total violations, including those that were never documented.
  • Source versus judgment: Several sources recounting the same event may provide slightly different versions and details identifying the actors, the crime scene and the nature of the event. Witness interpretations of those acts may differ from those of the organization collecting and subsequently analyzing the information. It is important to maintain the original sources distinct from judgments made about the raw information in order to maintain an audit trail. It should be possible to trace each published statistic can be backward through every transformation and decision that was applied to derive it. Furthermore, contradictions in the source information enable us to assess the underlying uncertainty about the event in social memory. We can repeatedly recalculate an estimate using different versions of the contradictions: if the results remain stable as the underlying data are perturbed, we can conclude that our results are robust to possible variation in the reporting.
  • Data coding and inter-rater reliability (IRR): Data coding is the process of converting unstructured information, such as a narrative testimony, into discrete facts such as names and roles of actors (victims, witnesses, perpetrators) in crimes, as well as the date and place of act. Data coding must not discard or distort information. When more than one person is identifying, classifying and counting the elements reported in a qualitative source, the results of what they find may differ slightly based on each individual's interpretation and care in doing the coding. These differences can be measured by measuring IRR. We give the same source document to several coders and compare their coded outputs: the extent to which the outputs are the same indicates the reliability of the coding process. Assessing and improving data quality helps us to defend the resulting analysis by showing that the analysis is calculated from a consistent application of the classification criteria to the raw data. More on coding here: Controlled Vocabulary.
  • Data security: Despite the importance of information, many human rights organizations lack the resources to preserve their data securely. Much of their information is stored in a single hard disk, and often in unencrypted form. Critical documentation is often subject to viruses, computer theft, accidents, neglect and staff turnover. Furthermore, files may contain sensitive identifying information about victims and perpetrators involved in abuses. If this information were to be compromised, it could put people at serious risk. An information management system must store data electronically, in multiple copies in multiple locations to prevent loss due to physical destruction. We encourage partners to encrypt data and allow only authorized users to access the information.

Our goal is to raise the technical standards among those working to analyze human rights to include addressing these seven issues. If these issues are not considered from the outset, they can cause serious problems at the data analysis and reporting stages of a human rights data analysis project.

A truthful count of human rights violations demands that every violation counts once and only once. De-duplication, the process by which we identify multiply-counted cases, assures that each violation is counted only once. All of our projects, even those small in size and scope, require careful de-duplication.[2] More on de-duplication here: Multiple Systems Estimation.

De-duplication is even more important for analysis of multiple sources because it enables us to use a technique called multiple systems estimation (MSE). MSE uses identification of multiply-reported violations - overlaps within and among different sources - to estimate the number of violations that were not reported at all. Thus, changes in de-duplication can affect the overall estimate - meaning that Multiple Systems Estimation.

In the past, the human rights community (including us) has performed de-duplication largely by hand, meaning that identifying overlaps has required individual human expertise, applied to each and every potential match. This process poses numerous problems. Among them are the amount of people, time and concentration it requires, making some large datasets simply impossible to analyze. In addition, criteria for hand-matching are subjective and inconsistently applied. Even with extensive testing, it is impossible to eliminate the different judgments people use in evaluating data, even over the course of a single person's work. Most importantly, hand-matching is not auditable or repeatable. Reviewers cannot do the matching themselves to validate our results, and we cannot do a complete re-match when incorporating new or updated data. During the past three years, colleagues at Benetech® HRP have made significant technical advances that address these key issues and speed and enhance our analyses. Most importantly, we have developed a usable, reproducible large-scale system for automatically matching duplicated records of the same violation, victim or perpetrator, among different datasets or within the same dataset. We have progressed from rule-based human matching in spreadsheets (1999-2003), to "smart suggestions" in Analyzer (2004-2005), to a prototype matcher (2006-2007), to our current implementation of a full-scale automated matcher suitable for use with large datasets.

Our continued progress is based on our study of scientific advances in academic computer science and statistics, and our technical ability to apply these advances to human rights data. In the last ten years, researchers in academic computer science have developed automatic de-duplication methods of sufficient quality to support our analysis. These new methods apply techniques from machine learning, in which human users "train" computers to derive and apply consistent criteria for determining whether two records match.[3] Over the last year and a half, we have selected and adapted machine-learning techniques for use with data on human rights violations. Developed by Stanford Computer Science Ph.D. student and HRP consultant Jeff Klingner and Benetech Senior Systems Administrator Dr. Scott Weikart, our new system for automated matching derives decision trees from user-labeled example pairs of matching and non-matching records. This system enables us to conduct consistent, high-quality, auditable, scalable and repeatable matching and de-duplication, without hand-inspecting each potential pair of records.

Both fundamentals and recent technical innovations in MSE originate in mathematical statistics, but HRP draws on a wide variety of disciplines as we seek the most accurate, transparent ways of understanding human rights data. Our MSE advances have included creative applications of Bayesian Model Averaging,[4],[5] and jackknifing,[6],[7] techniques for estimating statistical uncertainties in our data. From demography, HRP has begun researching applied techniques such as the back-projection of mortality.[8] HRP Director Dr. Patrick Ball continues development of techniques to make estimates for ever-smaller disaggregations across time, space, victims' sex and age, alleged perpetrator, and other social dimensions. Perhaps most importantly, HRP applies software engineering best practices in all its ongoing and developing projects, meaning that even the largest projects are manageable by our team, comprehensible to others, and capable of accommodating new data and analyses at a moment's notice.[9]

________________________________________

1This complexity is described in Patrick's 1996 handbook Who Did What to Whom? Planing and Implementing a Large-Scale Human Rights Data Project. Available online at http://shr.aaas.org/www/cover.htm. At the request of the UN Office of the High Commissioner for Human Rights in Colombia, the handbook was translated into Spanish in February 2008. Also see http://hrdag.org/resources/human_rights_data.shtml.
back

2Matching and de-duplication are synonyms.
back

3See Herzog, T.N., F.J. Scheuren, and W.E. Winkler (2007) Data Quality and Record Linkage Techniques. Springer; this book cites our Kosovo work as an example of best practice in multiple systems estimation. Also Bilenko, Mikhail, and R.J. Mooney, (2003) "On Evaluation and Training-Set Construction for Duplicate Detection." In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp.7-12, Washington, DC. Bilenko's research has greatly influenced us.
back

4For example, see Raftery, Adrian E . (1995) "Bayesian Model Selection in Social Research." Sociological Methodology 25: 111-163.
back

5Our report using BMA is available at http://www.hrdag.org/resources/publications/casanare-missing-report.pdf.
back

6For more on jackknifing, see Wolter, Kirk M. (2003) Introduction to Variance Estimation. Springer.
back

7Our Perú report using jackknifing can be found here: http://shr.aaas.org/peru/aaas_peru_5.pdf
back

8A review of mortality projection techniques can be found here: Lee, Ron (1985) "Inverse Projection and Back Projection: A Critical Appraisal, and Comparative Results for England, 1539 to 1871." Population Studies, 39(2):233-248.
back

9For example, see Subramaniam, Venkat and Andy Hunt (2005) Practices of an Agile Developer: Working in the Real World. Pragmatic Programmers. Spolsky, Joel (2004) Joel on Software. Apress.
back

Benetech.org