All is Not Lost with Incomplete Datasets
Structural Zero 10: What is a “structural zero”
I’m Maria Gargiulo. I started with HRDAG as an intern in 2018 and I’ve worked with the team as a statistician since 2020. I’m currently earning my doctoral degree from the London School of Hygiene and Tropical Medicine where my thesis explores the very statistical methods HRDAG uses to study and understand complex human rights violations around the world.
Recently, we wrote about the challenge of statistical bias: the data that is collected is seldom complete and may not be representative of the population we are trying to study. There are many types of statistical biases that can occur. We unpacked sampling bias, which occurs when a sample is not representative of the population due to how it was collected.
As a result of these statistical biases, some members of the population of interest—victims of human rights violations in our work—are never documented. We don’t know their names nor their stories and, by looking at the documented data alone, we can’t even know how many victims might be excluded.
These victims whose stories have not been recorded are structural zeros, the namesake of this newsletter. These missing records are considered “zeros” because they represent absences in the observed data and “structural” because they cannot be known from the documented data alone. As statisticians studying human rights violations, one of the central tasks of our work is accounting for these structural zeros. Failing to do so would risk misstating the true impacts of violence.
One of the main tools we use to address structural zeros is an approach called multiple systems estimation (MSE), also known as capture-recapture. First used by John Graunt in the 1600s to estimate mortality due to bubonic plague in England, MSE uses multiple incomplete data sources to make inferences about a population of interest. To do this, MSE methods use information about overlaps between data sources, something we can observe after carrying out deduplication or record linkage, a topic we’ll cover in a future newsletter.
What does this look like in practice? Imagine we have four different data sources, each describing individual, named victims of human rights abuses covering the same time period and geography. When we compare the data sources we find that they have some overlap: some victims are documented in all four sources, some in just three, some in two, and others in just a single source. However, not all instances of violence are recorded. The human rights violations which occurred but that were not documented by any of our sources are the structural zeros. The information about which source or combination of sources documented the observed victims is used in MSE models to estimate the size of the population of victims who weren’t documented by any of our available sources.
When we add this information to the information about the total number of unique victims we have documented, we can calculate an estimate of the total victim population. This estimate includes those victims whose stories have not yet been recorded and who would have otherwise been excluded from an analysis of the documented data alone.
This might seem abstract, so let me offer a concrete example from HRDAG’s work. Between 2020–2022, we collaborated with the Jurisdicción Especial para la Paz (JEP) and the Comisión para el Esclarecimiento de la Verdad, la Convivencia y la No Repetición (CEV; the Colombian Truth Commission) to study human rights violations during the armed conflict in Colombia. A goal of truth commissions is, as the name suggests, to provide an account of the truth of what happened. However, it’s challenging to do that when not all victims’ stories have been told and the existing data is statistically biased.
During this project, we used statistical methods, including MSE, to address missing data challenges. We sought to develop the most rigorous statistics that are technically possible about victimization due to disappearances, forced recruitment, homicides, and kidnappings. To do this, we processed over 12.5 million records of victims, contained in 112 databases, shared with the project by 44 organizations.
That is a vast amount of data. Indeed, the truth commission in Colombia is the largest human rights data project to date, highlighting the amount of effort put into data collection by organizations and data analysis by our 20-person team. At the same time, the size of the project underscores the extensive damage the conflict has had, and continues to have, on the country.
After correcting for missing data using MSE and other methods, we estimated that the four different types of violence we studied were likely underreported by between 40%–55% on average. Without our statistical analysis using MSE, these victims’ stories would have been excluded entirely from our retelling of the violence that occurred during the armed conflict.
Incomplete and unrepresentative data can feel like a dead end, but it doesn’t have to be. In research on armed conflict, imperfect data is frequently the only data that exists. Learning to acknowledge the limitations of and responsibly work with this type of data is our imperative as statisticians and data scientists analyzing human rights violations. On the contrary, it would be irresponsible to make sweeping claims about the world based on statistically biased data without first using tools like MSE to help us understand and correct for what’s missing.
This matters in human rights work because so often our data sources, however incomplete, were painstakingly collected and maintained over years. Those stewarding the data often faced great difficulty and personal risk to obtain the stories and reports used in our analyses. One way we can honor the labor that went into the production of human rights data is to responsibly use it in the pursuit of truth. Multiple systems estimation is a powerful tool towards that end.
Learn more about MSE and our work on Colombia:
Multiple Systems Estimation: The Basics
Multiple Systems Estimation: Collection, Cleaning and Canonicalization of Data
Multiple Systems Estimation: The Matching Process
Multiple Systems Estimation: Stratification and Estimation
Multiple Systems Estimation: Does it Really Work?
JEP-CEV-HRDAG Joint Project on human rights violations in Colombia
And if you liked this newsletter, please share it with a friend. Thanks for supporting our work,
MG
Maria Gargiulo
This article was written by Maria Gargiulo, statistician for the Human Rights Data Analysis Group (HRDAG), a nonprofit organization using scientific data analysis to shed light on human rights violations.
Structural Zero is a free monthly newsletter that helps explore what scientific and mathematical concepts teach us about the past and the present. Appropriate for scientists as well as anyone who is curious about how statistics can help us understand the world, Structural Zero is edited by Rainey Reitman and written by 4 data scientists who use their skills in support of human rights. Subscribe today to get our next installment. You can also follow us on Bluesky, Mastodon, LinkedIn, and Threads.
If you get value out of these articles, please support us by subscribing, telling your friends about the newsletter, and recommending Structural Zero to others.
Image: Amritha R Warrier & AI4Media / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/ Image modified by HRDAG

