Why raw data doesn’t support analysis of violenceby Patrick Ball
June 14, 2013
This morning I got a query from a journalist asking for our data from the report we published yesterday. The journalist was hoping to create an interactive infographic to track the number of deaths in the Syrian conflict over time. Our data would not support an analysis like the one proposed, so I wrote this reply.
We can’t send you these data because they would be misleading—seriously misleading—for the purpose you describe. Here’s why:
What we have is a list of documented deaths, in essence, a highly non-random sample, though a very big one. We like bigger samples because we think that they must be closer to true. The mathematical justification for this idea (“bigger samples have smaller errors!”) is called the central limit theorem. However, the theorem is only true if the samples are “independent and identically distributed.” Put simply, the samples have to be drawn randomly. Otherwise the central limit theorem doesn’t hold, and therefore patterns in a non-random sample (however big or small) have no necessary mathematical relationship to reality.
To understand a pattern (“caused over time”) you are making an statistical inference, a projection from the sample to the underlying population of deaths. If you were to use the raw data for the analysis, you would be assuming that the count for each day represents an equal, constant proportion of the total deaths on that day.
For example, imagine that you observe 100 killings on Thursday, 120 on Friday, and 80 on Saturday. Is there a peak on Friday? Maybe, maybe not. The real question is: how many killings really happened on Thursday? Let’s say there were 150, so you observed 100/150 = 0.66 on Thursday. Did you also observe 0.66 of the total killings on Friday? On Saturday? Again, maybe.
Or maybe not. Maybe on Friday your team worked really hard and observed 0.8 of the total killings: you observed 120 and there were really 150 (the same as Thursday). On Saturday, however, some of your team stayed home with their families, so you really only observed 0.5 of the total killings: you observed 80, but there were really 160. The true pattern of killings is therefore that the numbers were equal on Thursday and Friday—and Saturday was worse. The true pattern could be very different from the observed pattern.
When we estimate total killings, we often find variation in coverage rates all over the place: the example here is completely within a reasonable range (see our publications page for examples from previous projects). The point of quantitative reasoning is to give us a reliable sense of pattern and scale, and raw data simply cannot give us what we need. The non-random nature of the sample we presented yesterday means that making an inference from the data about patterns over time would be inappropriate.
Let me be clearer: using raw data on violence to analyze patterns is not just imprecise, it can be completely wrong, or even worse, misleading in complicated ways that can be staggeringly confusing (check out “event size bias” in raw mortality data in Iraq—subtle but devastating to analysis). We have done research on other conflicts showing that the raw, documented data can show trends and patterns that are the reverse of the true patterns. See Megan’s discussion of why data like this are problematic, our website’s core concepts page details on how we do analysis, and see our chapters in the recently-released Oxford UP volume on counting casualties referenced on the website.
The total count is a minimum, sure, and that’s what our report shows, but that’s all we can affirm with any scientific certainty.
We’ll be publishing a set of statistical estimates on the Syrian conflict in a scientific journal some time in the next few months. At that time, we’ll publish the estimates as a data file, and those will be an appropriate basis for graphics like the one you propose. The point of the estimates is precisely to adjust for the non-randomness of the samples. In essence, we will be modeling the sampling process in order to make estimates for each time-space-victim demographics point.