Yes, Your Dataset is Probably Biased
Structural Zero 08: What statisticians mean by bias—and why it matters
My colleagues at HRDAG and I are a little obsessed with the concept of truth.
And today, finding the truth can feel a Sisyphean task: online social bubbles contribute to intellectual isolation; the disappearance of local and nonprofit news outlets are creating news deserts; and AI can now generate images and content (including scientific-sounding papers) that people struggle to discern from human-created content.
In the face of all of this, it’s harder than ever to know what is true.
That’s what’s so refreshing and powerful about the field of statistics. As statisticians, we bring skepticism to every dataset and every conclusion. When we make conclusions about a research topic, we offer up our evidence, reasoning, and calculations for review. Most importantly, we talk about the degree of confidence we have in our findings and how we can use scientific models to improve that confidence.
We accept that there is often uncertainty in our findings, and then seek to measure that uncertainty and be ruthlessly transparent about it.
Rather than make truth unachievable, this approach helps statisticians have a clearer understanding of the truth we are seeking with our research. It’s a nuanced, grounded, science-based relationship to truth in a world in which facts and reality are too often contested.
We’re hoping this approach can spread and that many people, including those with no background in statistics, will start to use the reasoning and skepticism of statisticians when considering facts and numbers they encounter in their everyday lives. Our goal is not to create suspicion but to arm people with tools to better discern what is real whenever they encounter conclusions about data. That’s why we started our newsletter Structural Zero, to act as a bridge between the statistical ideas that inform our work and everyday people.
Today I want to talk about one of the most foundational concepts in statistics: bias, which has a very specific scientific meaning when we use it.
But before I do that, I want to zoom out and explain what I mean when I talk about statistics.
Statistics refers to the study of data. Data can show up in countless forms: traffic records collected by a toll booth, voting records, medical records, interviews conducted in refugee camps—any type of data that can be collected directly or indirectly about our question of interest is fair game for statisticians.
Whenever we are using statistics to study the world or examine a research topic, the thing we are studying is called the parameter. Sometimes we call this the population parameter, to be explicit that it refers to or is based on the entire population we are studying. For example, if we are researching people in San Francisco, we might want to know things like the average age, the gender or racial distribution, or the percent of dog owners. If these are calculated by collecting information about everyone in San Francisco, then these are population parameters.
But typically we don’t collect information about everyone, because it is too expensive or time consuming or logistically challenging. What we collect instead is a sample – a subset of the population we are interested in. We may, for example, collect records for our research project from 10% of San Francisco residents. That’s a sample of our population. Calculations based on this sample are referred to as “sample statistics.”
The crucial bit of magic that makes the entire field of statistics work is understanding what the sample statistic is telling you about the population parameter.
Sometimes, this is very straight-forward. For example, if our sample data is identical to our population, then our sample statistic is equivalent to a population parameter because we know we are dealing with all the possible data. When the sample dataset contains the total population, we refer to it as a census. HRDAG has worked on a few projects where we had a census—such as in our research in Kosovo, where today we believe that every person who was killed in the conflict in the late 1990s was accounted for in the dataset, a rarity in our research.
Other times, our data is representative of the population. In statistics, “representative” means that every attribute that might be related to what we are trying to study in our population is accurately reflected in our sample. For example, if our population is 50% male and 50% female, then our sample would also be 50% male and 50% female.
There are a few different approaches statisticians use to create a representative dataset. The gold standard for creating a representative sample is selecting a simple random sample, where each member of our population has an equal, calculable probability of being selected for our dataset.
An example of random sampling from HRDAG’s work was in the National Police Archives in Guatemala. This massive warehouse of mouldering archives contained millions of pages about Guatemalan police actions over a 100 year period. It would have been impossible to review and analyze every page. Instead, my colleague Patrick Ball led a team in creating a physical map of the different rooms in the warehouse and the stacks of documents contained in each room. He then designed a multi-stage sampling project (not a simple random sample) to select thousands of documents from different locations in the warehouse to use in our analysis. Combining mapping and statistical sampling techniques, Patrick and our partners on the ground selected and studied a representative sample over many months, which we then used to draw defensible conclusions about what was in the full archive.
But a representative sample isn’t always possible. In many cases, we are dealing with biased data.
In the real world, the term bias implies a negative or malicious intonation. In statistics, it’s just a way to describe a type of data. Bias refers to systematic differences between the population and the sample we observe.
Some of the types of bias we find in datasets include:
- Selection bias—when the likelihood of being “selected” into a sample varies across a population and is related to an attribute of interest, this results in a dataset that doesn’t accurately reflect the attributes of the population in meaningful ways. This can happen frequently with “convenience samples” – samples collected without a random sampling process.
- Event size bias—this is a type of selection bias where events of a certain type (such as events involving multiple people) are more likely to show up in the dataset than other events. HRDAG researched the problems caused by event sized bias in our work on the Iraq Body Count project, where incidents involving multiple deaths in Iraq were more likely to be recorded than other incidents.
- Disclosure bias—this is bias where people are likely to report events that are already known by others or perceived as safer to report. We see this in tracking instances of sexual violence; incidents that were witnessed by a third party are far more likely to be added to the dataset than incidents only witnessed by the victim and perpetrator, as both the victim and perpetrator could suffer social consequences for making a report. For more on disclosure bias, see this study about disclosure bias in conflict-affected adolescent girls in DRC and Ethiopia.
Bias in data is not insurmountable. Statisticians have developed tools to address the problems created by biased data, so that even imperfect data can help us understand the world. We’ll talk about how those tools work in future editions of this newsletter.
For now, the most important thing for readers to know is this: unless you happen to have data that is identical to your population (a census) or have a carefully designed process to pull a representative random sample, your dataset is likely to be biased. Which means that most of the datasets that policymakers, academics, researchers, and everyday people use to draw conclusions about the world around them are biased datasets.
Statisticians acknowledge the bias, work to shed light on it, use scientific methods to correct for it, and then offer conclusions that embrace the uncertainty caused by bias. This approach helps us understand that just because our dataset says one thing does not mean that is reflected in the real world.
This is a lesson for all people encountering datasets in the world, and a good reminder to always ask: what’s my data’s relationship to the world around it?
-mep
This article was written by Megan Price, executive director for the Human Rights Data Analysis Group (HRDAG), a nonprofit organization using scientific data analysis to shed light on human rights violations.
Image credit: Kathryn Conrad / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
P.S. In case you missed it, HRDAG co-hosted a powerful event in Oakland, CA about how data and transparency efforts support movements around police accountability. Speakers included activists, academics, and people who had lost their loved ones to police violence. You can watch it here.
Structural Zero is a free monthly newsletter that helps explore what scientific and mathematical concepts teach us about the past and the present. Appropriate for scientists as well as anyone who is curious about how statistics can help us understand the world, Structural Zero is edited by Rainey Reitman and written by 4 data scientists who use their skills in support of human rights. Subscribe today to get our next installment. You can also follow us on Bluesky, Mastodon, LinkedIn, and Threads.
If you get value out of these articles, please support us by subscribing, telling your friends about the newsletter, and recommending Structural Zero to others.

