HRDAG Notes on Data Analysis Technology and Research
The HRP team is recognized around the world for its
application of innovative methods and statistical models that bring
clarity to complex data. Our outcomes have been used successfully
to counter the claims of powerful leaders, and we work hard to assure
that our arguments are as thoroughly tested as existing technology
and available resources permit. We have identified seven characteristics
that human rights researchers need to take into account, which include:
- Under-registration: No one information system can capture all
the information related to every human rights violation in the
universe of interest. A data collection project's access may be
limited by a number of factors, which include geography, available
resources and the populations that are willing to share their
knowledge. The information available in any given system is a
sample of violations not necessarily representative of the real
world.
- Selection bias: Every data source will have better access to
some victims than to others, creating statistical bias. For example,
bias can arise from limited access to unreachable or dangerous
areas. The most frequent source of bias is that some victims or
witnesses will trust the organization collecting data, while others
will not.
- Complexity in a human rights violation or event: Human rights
abuses are rarely straightforward. For instance, a database on
assassinations may also include information about detentions or
torture that victims experienced before they were killed. If all
three violation types are of interest to the project, all three
violation types must be represented in the database. Collecting
information only on the "most important" violation will obscure
instances of the other violations, creating significant biases
in the analysis.[1]
- Duplicate reporting: sources often include several accounts
of the same violations because multiple witnesses may have reported
on an event. We need to identify which records describe the same
events and participants in order not to double- (or triple-) count
them. In some projects, we are able to use the duplicated reporting
to estimate the total violations, including those that were never
documented.
- Source versus judgment: Several sources recounting the same
event may provide slightly different versions and details identifying
the actors, the crime scene and the nature of the event. Witness
interpretations of those acts may differ from those of the organization
collecting and subsequently analyzing the information. It is important
to maintain the original sources distinct from judgments made
about the raw information in order to maintain an audit trail.
It should be possible to trace each published statistic can be
backward through every transformation and decision that was applied
to derive it. Furthermore, contradictions in the source information
enable us to assess the underlying uncertainty about the event
in social memory. We can repeatedly recalculate an estimate using
different versions of the contradictions: if the results remain
stable as the underlying data are perturbed, we can conclude that
our results are robust to possible variation in the reporting.
- Data coding and inter-rater reliability (IRR): Data coding is
the process of converting unstructured information, such as a
narrative testimony, into discrete facts such as names and roles
of actors (victims, witnesses, perpetrators) in crimes, as well
as the date and place of act. Data coding must not discard or
distort information. When more than one person is identifying,
classifying and counting the elements reported in a qualitative
source, the results of what they find may differ slightly based
on each individual's interpretation and care in doing the coding.
These differences can be measured by measuring IRR. We give the
same source document to several coders and compare their coded
outputs: the extent to which the outputs are the same indicates
the reliability of the coding process. Assessing and improving
data quality helps us to defend the resulting analysis by showing
that the analysis is calculated from a consistent application
of the classification criteria to the raw data. More on coding
here: Controlled Vocabulary.
- Data security: Despite the importance of information, many human
rights organizations lack the resources to preserve their data
securely. Much of their information is stored in a single hard
disk, and often in unencrypted form. Critical documentation is
often subject to viruses, computer theft, accidents, neglect and
staff turnover. Furthermore, files may contain sensitive identifying
information about victims and perpetrators involved in abuses.
If this information were to be compromised, it could put people
at serious risk. An information management system must store data
electronically, in multiple copies in multiple locations to prevent
loss due to physical destruction. We encourage partners to encrypt
data and allow only authorized users to access the information.
Our goal is to raise the technical standards among
those working to analyze human rights to include addressing these
seven issues. If these issues are not considered from the outset,
they can cause serious problems at the data analysis and reporting
stages of a human rights data analysis project.
A truthful count of human rights violations demands
that every violation counts once and only once. De-duplication,
the process by which we identify multiply-counted cases, assures
that each violation is counted only once. All of our projects, even
those small in size and scope, require careful de-duplication.[2]
More on de-duplication here: Multiple
Systems Estimation.
De-duplication is even more important for analysis
of multiple sources because it enables us to use a technique called
multiple systems estimation (MSE). MSE uses identification of multiply-reported
violations - overlaps within and among different sources - to estimate
the number of violations that were not reported at all. Thus, changes
in de-duplication can affect the overall estimate - meaning that Multiple Systems Estimation.
In the past, the human rights community (including
us) has performed de-duplication largely by hand, meaning that
identifying overlaps has required individual human expertise,
applied to each and every potential match. This process poses
numerous problems. Among them are the amount of people, time and
concentration it requires, making some large datasets simply impossible
to analyze. In addition, criteria for hand-matching are subjective
and inconsistently applied. Even with extensive testing, it is
impossible to eliminate the different judgments people use in
evaluating data, even over the course of a single person's work.
Most importantly, hand-matching is not auditable or repeatable.
Reviewers cannot do the matching themselves to validate our results,
and we cannot do a complete re-match when incorporating new or
updated data. During the past three years, colleagues at Benetech® HRP
have made significant technical advances that address these key
issues and speed and enhance our analyses. Most importantly, we
have developed a usable, reproducible large-scale system for automatically
matching duplicated records of the same violation, victim or perpetrator,
among different datasets or within the same dataset. We have progressed
from rule-based human matching in spreadsheets (1999-2003), to "smart
suggestions" in Analyzer (2004-2005), to
a prototype matcher (2006-2007), to our current implementation
of a full-scale automated matcher suitable for use with large
datasets.
Our continued progress is based on our study of scientific
advances in academic computer science and statistics, and our technical
ability to apply these advances to human rights data. In the last
ten years, researchers in academic computer science have developed
automatic de-duplication methods of sufficient quality to support
our analysis. These new methods apply techniques from machine learning,
in which human users "train" computers to derive and apply consistent
criteria for determining whether two records match.[3] Over the
last year and a half, we have selected and adapted machine-learning
techniques for use with data on human rights violations. Developed
by Stanford Computer Science Ph.D. student and HRP consultant Jeff
Klingner and Benetech Senior Systems Administrator Dr. Scott Weikart,
our new system for automated matching derives decision trees from
user-labeled example pairs of matching and non-matching records.
This system enables us to conduct consistent, high-quality, auditable,
scalable and repeatable matching and de-duplication, without hand-inspecting
each potential pair of records.
Both fundamentals and recent technical innovations
in MSE originate in mathematical statistics, but HRP draws on a
wide variety of disciplines as we seek the most accurate, transparent
ways of understanding human rights data. Our MSE advances have included
creative applications of Bayesian Model Averaging,[4],[5]
and jackknifing,[6],[7] techniques
for estimating statistical uncertainties in our data. From demography,
HRP has begun researching applied techniques such as the back-projection
of mortality.[8] HRP Director Dr. Patrick Ball continues development
of techniques to make estimates for ever-smaller disaggregations
across time, space, victims' sex and age, alleged perpetrator, and
other social dimensions. Perhaps most importantly, HRP applies software
engineering best practices in all its ongoing and developing projects,
meaning that even the largest projects are manageable by our team,
comprehensible to others, and capable of accommodating new data
and analyses at a moment's notice.[9]
________________________________________
1This complexity is described
in Patrick's 1996 handbook Who Did What to Whom? Planing and Implementing
a Large-Scale Human Rights Data Project. Available online at http://shr.aaas.org/www/cover.htm.
At the request of the UN Office of the High Commissioner for Human
Rights in Colombia, the handbook was translated into Spanish in
February 2008. Also see http://hrdag.org/resources/human_rights_data.shtml.
back
2Matching and de-duplication are synonyms.
back
3See Herzog, T.N., F.J.
Scheuren, and W.E. Winkler (2007) Data Quality and Record Linkage
Techniques. Springer; this book cites our Kosovo work as an example
of best practice in multiple systems estimation. Also Bilenko, Mikhail,
and R.J. Mooney, (2003) "On Evaluation and Training-Set Construction
for Duplicate Detection." In Proceedings of the KDD-2003 Workshop
on Data Cleaning, Record Linkage, and Object Consolidation, pp.7-12,
Washington, DC. Bilenko's research has greatly influenced us.
back
4For example, see Raftery, Adrian E . (1995) "Bayesian
Model Selection in Social Research." Sociological Methodology 25:
111-163.
back
5Our report using BMA is available at http://www.hrdag.org/resources/publications/casanare-missing-report.pdf.
back
6For more on jackknifing, see Wolter, Kirk M. (2003)
Introduction to Variance Estimation. Springer.
back
7Our Perú report using jackknifing can be found
here: http://shr.aaas.org/peru/aaas_peru_5.pdf
back
8A review of mortality projection techniques can
be found here: Lee, Ron (1985) "Inverse Projection and Back Projection:
A Critical Appraisal, and Comparative Results for England, 1539
to 1871." Population Studies, 39(2):233-248.
back
9For example, see Subramaniam, Venkat and Andy
Hunt (2005) Practices of an Agile Developer: Working in the Real
World. Pragmatic Programmers. Spolsky, Joel (2004) Joel on Software.
Apress.
back |