Epidemiology has theories. We should study them.
- How many people are infected now?
- How many people will be infected soon?
- How many people are likely to die?
- When will it be over?
A lot of people are working hard to answer these questions. But just because someone has “used data,” that does not mean it’s good science, or even science at all. In the presence of so many essays linking to dashboards and shiny visualizations, how can an interested non-technical reader find good science among the noise?
We think science starts with theories, that is, stories about how the world works. A theory can be tested using information from the world through data and experiments. But the theory, the data, and the experiments have to be done just right—or the results are nonsense. That means if someone is using data to make an argument about what’s coming, one of the first questions you should ask is: What’s the theory? Why should we believe that these data can tell us anything about the future?
Science often relies on models. A model is a simplified version of the theory. It tells us what we need to measure, and it gives us a way to calculate the things we want to know. The model gives us a way to test its results against measurements from the world. And finally, the model tells us how to figure out if our answer supports or contradicts the theory.
While models can seem simple when expressed in words, they can become complicated very quickly. Several good overviews have been published in the last few days. Physicist Bruno Gonçalves wrote an excellent overview of how the Susceptible-Infected-Removed model works. FiveThirtyEight has a solid explanation of Why It’s So Freaking Hard to Make a Good COVID-19 Model. (Stay tuned for our own explainers coming out in the next few days.)
In science (and in life) we often want to know lots of things that we do not currently know. So what do we know now?
- How many people have tested positive
- How many people have died in total, each day
- What fraction of all infected people die (in some very specific and special places)
- How long it takes from infection –> symptoms –> recovery/death (in some places for some people)
It’s worth noting that first item, which we can think of as confirmed case counts, is not very useful for answering the questions we started with. For instance, if testing is limited to those with the most severe symptoms, that could cause us to overestimate how deadly COVID-19 is. Raw tallies of diagnosed infections are not very useful measurements of the number of infected people. Instead, the case totals reflect some combination of the number of infections in the population, the rate of testing in the population, and the proportion of infections that cause symptoms. If we do more tests, we’ll find more infected people, and vice-versa. Worse, if we focus testing in some specific part of the population (e.g., NBA players), we may get an exaggerated sense of how high-risk that population is. At the same time, if we fail to test in some other part of the population (e.g., people in jail and prison), we may fail to understand how serious their risk is.
So, if you encounter a report or analysis that uses data in some way, how can you know if that analysis is reasonable science, or if it’s merely amateur guesswork? Good science will have these features:
- It will worry a lot about what it doesn’t know
- It is evaluated based on how well it can explain the data from the past, rather than the specific predictions it makes about the future
- It will be very explicit about the theory it’s using, the model it created, and assumptions needed to combine the theory with the available data to get to the model
- Projections without an explicit theory are just attempts to fit arbitrary curves to the numbers in front of us. While they might fit the data we can see, there’s no reason such projections should be correctly forecast future trends
- Good science will talk about how the findings from the model support or refine or reject the theory on which the model was built
- What is being measured, and what is being assumed (including about the pieces being measured)?
- How are uncertainty and imprecision included and discussed? All data have errors, and sometimes the models introduce additional uncertainty. How are these challenges addressed?
- How is the model’s fit assessed? Do the authors report how well the thing being estimated is consistent with the observed part of the pandemic?
This can be extremely difficult to do. It’s even hard sometimes to tell if a study done by reputable people with solid technical foundations have done the work correctly. This is why science tends to be somewhat slow, and peer review takes a while. During the intense urgency of the pandemic, a lot of science is released quickly. There may be errors, the assumptions may be badly founded, and the models may be inappropriate. This means as a society and as scientifically-engaged citizens, we need to be especially careful readers.
In an article a couple of weeks ago, we highlighted two studies that struck us as particularly good. In upcoming publications, we’ll do more of that, explaining new studies that seem to be exemplary. In another piece, we’ll explain how one epidemiological model works, linking the equations to word problems that non-statisticians can read.
Most of what we think we can do during the pandemic is explain. These models are complicated, but the stakes are high. Understanding these models can help to understand what is likely to come, and why we all need to stay home and wash our hands.