Why raw data doesn’t support analysis of violence

by Patrick Ball
June 14, 2013

This morning I got a query from a journalist asking for our data from the report we published yesterday. The journalist was hoping to create an interactive infographic to track the number of deaths in the Syrian conflict over time. Our data would not support an analysis like the one proposed, so I wrote this reply.

We can’t send you these data because they would be misleading—seriously misleading—for the purpose you describe. Here’s why:

What we have is a list of documented deaths, in essence, a highly non-random sample, though a very big one. We like bigger samples because we think that they must be closer to true. The mathematical justification for this idea (“bigger samples have smaller errors!”) is called the central limit theorem. However, the theorem is only true if the samples are “independent and identically distributed.” Put simply, the samples have to be drawn randomly. Otherwise the central limit theorem doesn’t hold, and therefore patterns in a non-random sample (however big or small) have no necessary mathematical relationship to reality.

To understand a pattern (“caused over time”) you are making an statistical inference, a projection from the sample to the underlying population of deaths. If you were to use the raw data for the analysis, you would be assuming that the count for each day represents an equal, constant proportion of the total deaths on that day.

For example, imagine that you observe 100 killings on Thursday, 120 on Friday, and 80 on Saturday. Is there a peak on Friday? Maybe, maybe not. The real question is: how many killings really happened on Thursday? Let’s say there were 150, so you observed 100/150 = 0.66 on Thursday. Did you also observe 0.66 of the total killings on Friday? On Saturday? Again, maybe.

Or maybe not. Maybe on Friday your team worked really hard and observed 0.8 of the total killings: you observed 120 and there were really 150 (the same as Thursday). On Saturday, however, some of your team stayed home with their families, so you really only observed 0.5 of the total killings: you observed 80, but there were really 160. The true pattern of killings is therefore that the numbers were equal on Thursday and Friday—and Saturday was worse. The true pattern could be very different from the observed pattern.

When we estimate total killings, we often find variation in coverage rates all over the place: the example here is completely within a reasonable range (see our publications page for examples from previous projects). The point of quantitative reasoning is to give us a reliable sense of pattern and scale, and raw data simply cannot give us what we need. The non-random nature of the sample we presented yesterday means that making an inference from the data about patterns over time would be inappropriate.

Let me be clearer: using raw data on violence to analyze patterns is not just imprecise, it can be completely wrong, or even worse, misleading in complicated ways that can be staggeringly confusing (check out “event size bias” in raw mortality data in Iraq—subtle but devastating to analysis). We have done research on other conflicts showing that the raw, documented data can show trends and patterns that are the reverse of the true patterns. See Megan’s discussion of why data like this are problematic, our website’s core concepts page details on how we do analysis, and see our chapters in the recently-released Oxford UP volume on counting casualties referenced on the website.

The total count is a minimum, sure, and that’s what our report shows, but that’s all we can affirm with any scientific certainty.

We’ll be publishing a set of statistical estimates on the Syrian conflict in a scientific journal some time in the next few months. At that time, we’ll publish the estimates as a data file, and those will be an appropriate basis for graphics like the one you propose. The point of the estimates is precisely to adjust for the non-randomness of the samples. In essence, we will be modeling the sampling process in order to make estimates for each time-space-victim demographics point.

[Creative Commons BY-NC-SA]

< | >
  • > HRDAG

    The Human Rights Data Analysis Group is a non-profit, non-partisan organization that applies rigorous science to the analysis of human rights violations around the world.
  • > Recent Stories

    Update of Iraq and Syria Data in New Paper

    Patrick Ball Honored with Degree at Claremont Graduate University

    When Data Doesn’t Tell the Whole Story

    Focus on Good Science, not Scientists

    HRDAG Offers New R Package – dga

    How many police homicides in the US? A reconsideration

    HRDAG Retreat 2015

    BJS Report on Arrest-Related Deaths: True Number Likely Much Greater

    The Great Lessons in Research at the Archive

    Evaluation of the Kosovo Memory Book



    You are welcome to use these datasets for your research. If you publish with them, however, we ask that you include the following text: "These are convenience sample data, and as such they are not a statistically representative sample of events in this conflict.  These data do not support conclusions about patterns, trends, or other substantive comparisons (such as over time, space, ethnicity, age, etc.)."

    For reference and further information please see this blogpost about raw data and this blogpost about convenience samples. In addition, we recommend you read the following: Dorofeev, S. and P. Grant (2006). Statistics for Real-Life Sample Surveys. Cambridge University Press; and van Belle, Gerald (2002). Statistical Rules of Thumb. Wiley.

    If you use these data, please cite them with the following reference: