25 October 2016

This is a semi-technical introduction to estimating undocumented events by using multiple, intersecting datasets. This approach is called “capture-recapture” or “multiple systems estimation,” I’ll refer to this approach as MSE. There’s an extended, non-technical introduction to MSE here. A somewhat more technical introduction to the frequentist approaches to MSE is given in a 2013 article by Banks, Lum, and Price.

The point of MSE is to use multiple, partial lists of a population to estimate the total population. The method depends on the integration of the lists, in which records that refer to the same elements of the population (called “co-referent” records) are linked in clusters. There is more explanation of the data in the Data Setup section.

In the example presented in this post, I’m using a version of the data I presented in the trial for genocide of Gen. Efraín Ríos Montt of Guatemala. These data are for April 1982 through July 1983; for all perpetrators, including the Army, police, civilians, the URNG, and unknown perpetrators; and for victims of all ethnicities. Note that this is a little bit different from the data I presented in the trial: those data were stratified by the victims’ ethnicity, and included only killings by the Army.

The data consist of four columns of [0,1] which indicate, for that row, the presences or absence of a given dataset in that row’s observed frequency. Each row of the data is counting all the records that have this particular “capture history,” that is, the records that have been observed and not observed by this combination of datasets.

This is much harder to describe than it is to see, so let’s jump in.

```
A <- c(rep(0,8), rep(1,8))
B <- rep(c(rep(0,4), rep(1,4)), 2)
C <- c(rep(c(0,0,1,1), 4))
D <- c(rep(c(0,1), 8))
# killings in Guatemala's Ixil counties, April 1982 - July 1983
# NB: these are *all killings*, NOT filtered by perpetrator
Freq <- c(NA, 1021, 104, 94, 55, 54, 2, 4, 53, 98, 19, 11, 4, 8, 1, 2)
deaths <- as.data.frame(cbind(A, B, C, D, Freq))
print(deaths)
```

```
## A B C D Freq
## 1 0 0 0 0 NA
## 2 0 0 0 1 1021
## 3 0 0 1 0 104
## 4 0 0 1 1 94
## 5 0 1 0 0 55
## 6 0 1 0 1 54
## 7 0 1 1 0 2
## 8 0 1 1 1 4
## 9 1 0 0 0 53
## 10 1 0 0 1 98
## 11 1 0 1 0 19
## 12 1 0 1 1 11
## 13 1 1 0 0 4
## 14 1 1 0 1 8
## 15 1 1 1 0 1
## 16 1 1 1 1 2
```

The first row [0,0,0,0] describes the number of killings that are *not observed* by datasets A, B, C, or D. The number of unobserved killings is unknown, so the value is set as NA, that is, as a missing value.^{1} The second row [0,0,0,1] gives the count of killings that are observed *only* by dataset D, and not by any of the others: 1021. The final row [1,1,1,1] gives the count of killings observed by *all four* datasets: only 2.

Generating data like this for more than a few records is pretty hard work. It’s the outcome of “database deduplication” or “record linkage” in which we combine multiple datasets, linking all the records that refer to the same person. That work is beyond the scope of this post, but a few notes can be found here.

The order of the rows is important for most of the libraries below.

The point of MSE is to use the information about the counts in the intersections of datasets to predict the unobserved deaths, i.e., those in the [0,0,0,0] row. This notebook will show three libraries’ approach to this estimation problem.

The `Rcapture`

package by Louis-Paul Rivest and Sophie Baillargeon provides tools to estimate the standard log-linear models used for capture-recapture. The theoretical approach starts with the estimation of the size of a closed population, described by Bishop, Fienberg, and Holland (1975), chapter 6.^{2} Among the aspects of this package I really like is the richness of the references in the documentation. Reading the `Rcapture`

docs is like a literature review of the frequentist approaches to MSE.

The code below estimates all the possible log-linear models, and then shows the ten best models.

```
library(Rcapture)
result <- closedpMS.t(deaths, dfreq=TRUE, h='Poisson')
print(result)
```

```
##
## Number of captured units: 1530
##
## Abundance estimations and model fits for the models with the smallest BIC:
## abundance stderr bias deviance df AIC BIC infoFit
## [34,1,2] 2330.8 137.2 -2.0 19.738 8 106.025 143.356 OK
## [13,2,4] 2436.6 148.6 -2.6 21.233 8 107.519 144.850 OK
## [13,14,2] 2558.7 179.5 -0.7 13.997 7 102.284 144.948 OK
## [13,34,2] 2288.4 133.2 -0.6 14.021 7 102.308 144.972 OK
## [1,2,3,4] 2541.5 153.8 -4.6 29.216 9 113.503 145.501 OK
## [23,34,1] 2390.5 151.4 2.4 14.758 7 103.044 145.708 OK
## [24,34,1] 2228.1 130.6 0.2 16.142 7 104.429 147.093 OK
## [23,24,34,1] 2233.0 138.6 2.8 8.970 6 99.256 147.254 OK
## [12,13,14] 2469.1 169.3 2.3 9.557 6 99.844 147.841 OK
## [14,2,3] 2687.9 189.4 -2.5 24.256 8 110.542 147.873 OK
```

The simple `print`

output shows the best few models. The “number of captured units” is the number of observed elements, in this example, the number of people documented as killed; we usually call this ${N}_{k}$. The “abundance” column shows the estimate of $\hat{N}$, the total population including the observed and the estimated unobserved deaths. The AIC and BIC columns show the “information coefficients” which balance the goodness of fit (shown in the “deviance” column) with the information used to estimate the model (shown in the “df” column; my favorite explanatin of the BIC is here).

With a tiny bit of additional hacking on the result object, we can extract pieces of the result to embed them into the markdown text directly.

As a side note, embedding results directly in the text is **really important**. It’s sometimes called “reproducible research”, and in our team, it’s part of “principled data processing.”

```
models <- as.data.frame(result$results)
models <- models[order(models$BIC), ]
```

Each model controles for a subset of all the possible interactions among the models. In the context of MSE, the two- and three-way interactions estimate (and to some extent, control for) associations in the probabilities of capture between (and among) the lists. For example, is a certain person more likely to be seen on one list, and also more likely to be seen on a second list? If associations like this are present (and they usually are), they can bias the estimate.

There are a lot of possible models we might consider, with different combinations of associations included. In a frequentist approach, we usually choose the model which optimizes the combination of the maximum goodness of fit with the minimimum amount of information used in the model; this measure is found with BIC. In this case, the model with the minimum BIC is [34,1,2]. The notation indicates that the only interaction being estimated (and consequently controlled) in this model is the the one between dataset 3 and dataset 4.

For this model, the BIC = 143.4, and the estimated total killings given that model is $\hat{N}$ = 2331. (It might be interesting to look at the r-markdown code to see how I embedded values from the R `models`

dataframe into the text here)

Once you have chosen a model, `Rcapture`

has a function called `closedpCI.t`

which can be used to estimate a multinomial profile likelihood confidence interval. These confidence intervals would of course be too narrow, since they ignore the uncertainty in selection of the model: why choose [34,1,2] instead of [13,2,4]? This choice itself introduces additional uncertainty, but we cannot quantify it directly. I’m presenting the frequentist log-linear approach here mostly to justify the Bayesian approaches coming later, so I’ll leave the confidence interval as an exercise for the reader.

Note that these ten best models have slightly different estimates, ranging from 2228 to 2688, all within very similar BIC values. Which one should we use?

This is the fundamental problem of the frequentist, log-linear approach. We could just pick the “best” model, i.e., the one with the lowest BIC. I took that approach in a number of my MSE projects (e.g., Kosovo, Perú, Guatemala for the Ríos Montt trial).

Unfortunately, just picking one model ignores the error that we introduce by the selection itself. And it ignores the real variability in the estimates that result from different decisions about which inter-system dependencies to control for, and which to exclude. By picking only one model, we control for the dependencies included in the model, but we ignore dependencies that are not in the model.

Bayesian approaches offer a solution to these limitations. My colleagues and I tried an intermediate version of this approach in our work estimating lethal violence in Casanare, Colombia. However, there’s a better, more principled approach.

In a series of papers in the 1990s, statisticians found that decomposable graphical models allow us to estimate the the marginal likelihood of each of the graphical models (graphical models are a subset of hierarchical models). The advantage of this approach is that the marginal likelihood can be understood as proportional to the probability that for certain data, a given model is true, relative to all the other models. That means that the marginal likelihood is the ideal weight for averaging across all the models. This avoids the problem of having to choose a single model.

The first of the papers describing this approach was by Dawid and Lauritzen (1993).^{3} In a 1997 *Biometrika* article, Madigan and York extended the approach to the estimation of a closed population.

In 2014, statisticians James Johndrow, Kristian Lum and I implemented this approach in an R package called dga. Here’s how to use it.

```
library(dga)
data(graphs4)
Y <- deaths$Freq
Y[1] <- 0 # remember that in the Rcapture approach, this is NA
Y <- array(Y, dim=c(2,2,2,2)) # the dataframe must be sorted by the A,B,C,D columns so that this array is shaped properly
Nmissing <- 1:(2*sum(Y)) # Nmissing is a sequence from 1 to twice the number of observed values.
num.lists <- 4 #
delta <- 1/(2^num.lists) # this is the shape of the prior distribution
```

In the setup, the Nmissing distribution needs to run from 1 to some large number. It is, as the name of the variable implies, the number of missing elements of the population that will be estimated. The maximum needs to be substantially larger than the likely number of unobserved elements of the population.

```
weights <- bma.cr(Y, Nmissing, delta, graphs4)
plotPosteriorN(weights, (Nmissing + sum(Y))) # Nhat = missing + observed
```

Bayesian methods often provide us with an estimate of the full posterior distribution of parameters, rather than a single number. The distribution has a shape which tells us which values are most likely. In the graph above, the dark black line indicates the probability of each of the values shown on the horizonatal axis.

Note that there are two peaks, quite close together. This means that there are two groups of models which have similar estimates. The distributions for the most likely models are shown in dashed lines – and notice that they are not exactly the same. This is the same issue we saw in the frequentist estimation: multiple models with similar plausibility but with different estimates.

In the Bayesian context, we can average across the models to get a single estimate. The effect of each of the individual models is weighted by its marginal likelihood, and the combined, weighted estimate is shown in the dark line. It’s a Bayesian estimate, which means it is a distribution, not a single number. The graph below shows a sample of 1000 values from the distribution of the estimated total number of deaths.

```
post_probs <- apply(weights, 2, sum)
m0s_sample_from_post <- sample(Nmissing, size=1000, replace=TRUE, prob=post_probs)
nhat_sample_from_post <- m0s_sample_from_post + sum(Y)
png(file="gt-dga-nhat_sample_from_post.png", bg="white")
hist(nhat_sample_from_post, breaks = 50,
main="Posterior Distribution of Estimated Killings",
xlim=c(1500,4500))
dev.off()
```

```
## quartz_off_screen
## 2
```

This graph is very similar to the previous graph. The point here is to compute and visualize the distribution itself. As before, there are several slightly different peaks.

```
qs <- quantile(nhat_sample_from_post, c(0.025, 0.5, 0.975))
CI <- paste('[', round(qs['2.5%'], 0), ', ', round(qs['97.5%'], 0), ']', sep='')
```

(note that the code for `CI`

above creates text that will be referenced to print the credible interval below)

Finally, if we want to use the distribution directly, we can compute the median and the credible interval. The interpretation is therefore that given the prior distributions, this data, and the decomposable graphs approach, the median estimate = 2356, and there is a 95% probability that the true value falls in the range [2072, 2756].

The `dga`

approach solves most of the problems with the frequentist approach. It provides a way to integrate the results of all the models. Even more usefully, it creates the posterior distribution of the estimate, which gives a much richer understanding of the shape of the uncertainty than a frequentist error estimate allows.

As with the frequentist approach, we’re estimating a series of models, each of which controls for the interactions in one way. We combine the results of all the models with the marginal likelihoods, which makes this approach more robust than any of the frequentist approaches.

This approach is currently the one we prefer at HRDAG. It’s a big advance relative to the frequentist approaches we used in 1999–2013, and we’ve been working with multiple, competing models for a long time. However, there are even better approaches coming.

Statistician Daniel Manrique-Vallier has come up with a quite different approach. Intuitively, this method seeks to organize records into “latent classes.” Each record is assigned to a class, and for each class, the independence model is the best fit. If there is a large enough number of classes, each record would be its own class, so this approach has to be correct in the limit. The method finds the classes for which independence within each class is optimal, then it averages across the classes.

Said differently, that for the records within each class, there are zero interactions among the capture probabilities for the different datasets. (Recall that there is a different interaction model for each class) More formally, the binary variables indicating each record’s presence/absence in each dataset are independent conditional on latent class.

Furthermore, this approach explicitly models capture probabilities as heterogeneous in the population, which is an assumption managed less directly in the log-linear and decomposable graphical approaches.

The theory for this appraoch is explained in this article, and the `LCMCR`

software is here. Here’s how to use the package.

```
require(LCMCR)
# LCMCR requires that the list-membership columns be factors
for (col in c('A', 'B', 'C', 'D')) {
deaths[[col]] <- as.factor(deaths[[col]])
}
lcmcr.deaths <- deaths[!is.na(deaths$Freq), ] # drop the 0,0,0,0,NA row
sampler <- lcmCR(captures=lcmcr.deaths, tabular=TRUE)
```

`## WARMING UP...`

```
N <- lcmCR_PostSampl(sampler, burnin = 10000, samples = 1000, thinning = 100)
hist(N, breaks=50,
main="Posterior Distribution of Estimated Killings",
xlim=c(1500,4500))
```

As before, we can compute the quantiles for a point estimate and credible interval.

```
qs <- quantile(N, c(0.025, 0.5, 0.975))
CI <- paste('[', round(qs['2.5%'], 0), ', ', round(qs['97.5%'], 0), ']', sep='')
```

Note first that as with the `dga`

approach, we can compute the median and the credible interval. The interpretation is therefore that given the priors, this data, and the LCMCR approach, the median estimate = 2360, and there is a 95% probability that the true value falls in the range [1998, 2828]. The estimate and interval is very similar to the `dga`

estimate and interval.

The `LCMCR`

approach is probably the best option available for MSE right now, specifically because the point of MSE in human rights work is to estimate the unobserved events (other uses of MSE might be more interested in how the datasets interact, and `LCMCR`

would be less useful for that purpose). We’re still testing it and learning about it’s strengths and weaknesses. In particular, `LCMCR`

uses a Markov chain Monte Carlo appoach to estimate the parameters, so we need to apply convergence diagnostics to check for failure to converge.

There are a lot of really useful extensions to MSE that `LCMCR`

enables. For example, Johndrow, Lum, and Manrique-Vallier have proposed a method that estimates only the population which has some reasonable probabilty of being observed. We have always known that if there is no chance that a given death can be observed, then there is no chance we can estimate it. Instead, the authors “propose estimating the population of individuals with capture probability exceeding some threshold” (the threshold is called “$\alpha $-observability”). This approach helps us to be clear (in a principled and explicit way) about the deaths we can possibly estimate, and as a happy side effect, setting $\alpha $-observability reduces the variance of the estimate.

First off, wow! These tools are **so much better** than what we had even a few years ago. I spent about 15 years writing model selection routines by hand, implementing essentially *ad-hoc* model averaging approaches, and trying to run models over hundreds of strata. This is a lot better, for both the frequentist and Bayesian methods, there’s both easier programming and more principled approaches.

Raw data is limited: we don’t know what we don’t know. MSE is one way to estimate what is not directly observed (there are others, this is just the one we use). There are a number of ways to do MSE, and it is a rich and rapidly developing research area. Stay tuned!

I am grateful for the very useful comments and corrections from my colleagues Dr Megan Price, Dr Anita Gohdes, Dr James Johndrow, and Dr Daniel Manrique-Vallier. All mistakes remain my responsibility, of course.

We thank our project partner, the Human Rights, Big Data and Technology Project, funded by the Economic and Social Research Council.

“Computing Multiple Systems Estimation in R” by HRDAG is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.

You are free to:

- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow the license terms. See the link for the full license.

For the raw R Markdown code for this post, click here. With the R Markdown file, you can run the code shown here in RStudio, and see how I used variables embedded in the text to create a “reproducible research” document.

In their classic work on the analysis of contingency tables, this value is called a “structural zero” by Bishop, Fienberg, and Holland (1975).↩

Bishop, Y., S.E. Fienberg, and P. Holland. 1975.

*Discrete Multivariate Analysis*. MIT Press.↩Dawid, A.P., and S. L. Lauritzen. “Hyper Markov laws in the statistical analysis of decomposable graphical models.” The Annals of Statistics (1993): 1272-1317↩