Who Is Missing From the Data?

Structural Zero 09: Sampling Bias

I’m Patrick Ball. I’m a statistician who has spent my life using scientific techniques to seek the truth about state violence, both in the US and in countries around the world.

Last month, we introduced readers to the concept of statistical bias and acknowledged that many of the datasets we work with are affected by some form of bias. Today, I want to drill down on a particular type of selection bias.

Bias is a systemic difference between data (which we collect and can examine directly) and reality (which we can observe only through data). Sometimes this difference is the consequence of some aspect of the data collection process, and in these cases, it’s called sampling bias. We bump into sampling bias a lot in our work at HRDAG, for example when people who could be in our sample don’t submit reports whether because they fear retaliation by the perpetrators; or because potential respondents distrust the data project; or because they live in an area that’s difficult to reach; or for any number of other reasons. A big part of my job as a statistician is to think about who might be left out of the data.

I confronted the problems of sampling bias in South Africa in 1996 when I worked with the Truth and Reconciliation Commission. The Commission was established in the aftermath of violent clashes that had brought the country to the brink of civil war, as people rose up against apartheid—a political regime that excluded nonwhite people from most social opportunities, rights, and owning land and capital—that had dominated South Africa since the late 1940s. As the violence subsided, scientists and analysts like me were tasked with examining records and testimony to help survivors and the nation understand what had truly happened.

Check out the video companion to this newsletter below:

One of the central conflicts of apartheid’s final years occurred in South Africa’s northeastern region of KwaZulu-Natal between the African National Congress (ANC) and the Inkatha Freedom Party (IFP). While the ANC defined itself by absolute opposition to apartheid, the IFP’s position was more complicated. They opposed apartheid as a policy, of course, but they explicitly represented people of the Zulu ethnicity. Tracing its origins to the pre-colonial Zulu kingdom, the IFP claimed a distinct political legitimacy that was not solely rooted in anti-apartheid struggle. During the 1980s, the IFP aligned itself with right-wing parties internationally, including the Reagan administration, and at times made tactical agreements with South Africa’s ruling National Party to counter the ANC.

The conflict between the ANC and the IFP was deadly. In addition to frequent street clashes between demonstrating militants, both parties attacked the civilian bases of the other with beatings, looting, and arson. More than 18,000 people died in the inter-party conflict between 1986-1994. During this time, the ANC frequently accused the IFP of working with apartheid government police and military intelligence against the ANC, a charge the IFP denied but that was later confirmed in the Commission’s investigations.

Throughout the Commission’s early work in 1996 and 1997, the IFP accused the Commission of being prejudiced against them. IFP leaders refused the Commission’s requests for information and at first they even refused to participate in the Commission’s hearings. Not only that, but the leadership of the IFP discouraged their Zulu constituents from giving statements to us.

This worried me and others in information management roles at the Commission. We knew that the history we were building would be based on the information collected in statements and at hearings. Without the IFP’s stories, the history risked being even more partial than histories inevitably must be.

As the Commission’s statement-taking wound down, senior commissioners warned the IFP that without their statements, their constituents would not be eligible for the compensation payments that the Commission was authorized to recommend. In the last weeks of the Commission’s work, IFP members began to participate in large numbers. They gave thousands of statements. By the end, IFP members had given over 8,000 statements to the Commission, more than one-third of the eventual total.

These statements were invaluable and helped us better understand the nature of violence in South Africa at the time. But I was nonetheless left wondering about what was missing from the data. How would our story have been different without the IFP’s statements? Other, smaller African parties such as the Pan-African Congress never encouraged their members to participate in the Commission, which means that relatively few of their stories contributed to the Commission’s conclusions. Would our conclusions change if we had that data? Who else might have distrusted our process or even feared repercussions for participating?

Systems of data collection shape knowledge. Every dataset is produced by a set of choices—about technology, language, geography, incentives, and risk. When certain voices don’t appear in the data, their absence can be misinterpreted. We might think these voices weren’t impacted when the true story could be these voices were afraid to speak out. Gaps in the data don’t happen randomly; they reflect fear, power, access, and trust. When people are missing from a dataset, that absence is itself information.

If we ignore those absences, we risk telling incomplete or even misleading stories. We may conclude that violence is rare where reporting is dangerous, or that certain communities were unaffected when, in reality, they were unheard. In this way, sampling bias can quietly reinforce the very inequalities we aim to expose.

So the task is not only to analyze the data we have, but to interrogate it. Who is missing? Who was unable—or unwilling—to participate? What barriers shaped the dataset before the analysis even began?

In our next edition of Structural Zero, we’ll explore the approaches statisticians use to account for these gaps. But the first step is conceptual: recognizing that every dataset carries the imprint of how it was collected.

This article was written by Patrick Ball, Director of Research for the Human Rights Data Analysis Group (HRDAG), a nonprofit organization using scientific data analysis to shed light on human rights violations. You can also follow us on Bluesky, Mastodon, and LinkedIn.

Structural Zero is a free monthly newsletter that helps explore what scientific and mathematical concepts teach us about the past and the present. Appropriate for scientists as well as anyone who is curious about how statistics can help us understand the world, Structural Zero is written by 5 data scientists and edited by Rainey Reitman.

If you get value out of these articles, please support us by subscribing, telling your friends about the newsletter, and recommending Structural Zero.

Image: Original art by Hanna Barakat & Archival Images of AI + AIxDESIGN / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/ Design edited by David Peters of HRDAG.

Who Is Missing From the Data?

Structural Zero 09: Sampling Bias

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.

Donate

HRDAG

Selected projects

Stay informed about our work

Who Is Missing From the Data?

Structural Zero 09: Sampling Bias

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents. Donate

HRDAG

Selected projects

Stay informed about our work

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.

Donate