Implementing LCMCR in Stan
The statistical core of HRDAG’s methodology is to estimate undocumented events using several intersecting datasets, a technique called multiple systems estimation (MSE). Since 2016, HRDAG has used a fully Bayesian technique using latent class models, called LCMCR, with a corresponding R package.
The R package is a highly optimized Gibbs sampler, which has both pros and cons. On the one hand, the technique is fast and reliable, and has produced useful estimates for HRDAG over the last six years. On the other hand, the heavy optimization combined with the nature of Gibbs sampling makes it difficult to extend the model to accommodate desired future features.
To increase the future extensibility of our models, we have implemented LCMCR in Stan, an easy-to-use platform for statistical modeling, where we input the model in the declarative format of the Stan probabilistic programming language. All of the Stan code, and the code for running the diagnostic experiments below, are available on Github.
As a first order of business, we generate five simulated datasets (all of which have a true underlying population size of N=2000), and compare the outputs between the R LCMCR package and our Stan implementation.
The first simulation is identical to the simulation used in Manrique-Vallier (2016). There are two strata, with 90% of the population falling into the first strata, and 10% falling into the remaining strata, and five lists. The probability of being observed in a list by strata is as follows:
library(pacman)
pacman::p_load(here, yaml, dplyr, magrittr, LCMCR)
simulations <- read_yaml(here("write", "hand", "simulations.yaml"))
datasets_raw <- simulations$data
extract_table_df <- function(dataset_rows) {
strata <- c()
ns <- c()
l1 <- c()
l2 <- c()
l3 <- c()
l4 <- c()
l5 <- c()
for (i in 1:length(dataset_rows)) {
raw_row <- dataset_rows[[i]]
strata <- c(strata, i)
ns <- c(ns, raw_row$size)
l1 <- c(l1, raw_row$probs[[1]])
l2 <- c(l2, raw_row$probs[[2]])
l3 <- c(l3, raw_row$probs[[3]])
l4 <- c(l4, raw_row$probs[[4]])
l5 <- c(l5, raw_row$probs[[5]])
}
data.frame(Stratum=strata, N=ns, `List 1`=l1, `List 2`=l2, `List 3`=l3, `List 4`=l4, `List 5`=l5)
}
df_table <- extract_table_df(datasets_raw[[1]]$rows)
knitr::kable(df_table)
Stratum | N | List.1 | List.2 | List.3 | List.4 | List.5 |
---|---|---|---|---|---|---|
1 | 1800 | 0.033 | 0.033 | 0.099 | 0.132 | 0.033 |
2 | 200 | 0.660 | 0.825 | 0.759 | 0.990 | 0.693 |
For the second simulated dataset, we lower the probabilities of observation, and increase the proportion of people in the second stratum:
df_table <- extract_table_df(datasets_raw[[2]]$rows)
knitr::kable(df_table)
Stratum | N | List.1 | List.2 | List.3 | List.4 | List.5 |
---|---|---|---|---|---|---|
1 | 1600 | 0.013 | 0.013 | 0.099 | 0.132 | 0.033 |
2 | 400 | 0.560 | 0.825 | 0.759 | 0.990 | 0.593 |
For the third simulated dataset, we create an example with higher probabilities of observation:
df_table <- extract_table_df(datasets_raw[[3]]$rows)
knitr::kable(df_table)
Stratum | N | List.1 | List.2 | List.3 | List.4 | List.5 |
---|---|---|---|---|---|---|
1 | 1600 | 0.113 | 0.113 | 0.199 | 0.232 | 0.133 |
2 | 400 | 0.860 | 0.925 | 0.859 | 0.990 | 0.793 |
For the fourth simulated dataset, we introduce three strata:
df_table <- extract_table_df(datasets_raw[[4]]$rows)
knitr::kable(df_table)
Stratum | N | List.1 | List.2 | List.3 | List.4 | List.5 |
---|---|---|---|---|---|---|
1 | 1600 | 0.1 | 0.05 | 0.15 | 0.2 | 0.1 |
2 | 200 | 0.5 | 0.40 | 0.60 | 0.4 | 0.5 |
3 | 200 | 0.8 | 0.90 | 0.60 | 0.7 | 0.7 |
For the fifth simulated dataset, we introduce negative list dependencies across the strata:
df_table <- extract_table_df(datasets_raw[[5]]$rows)
knitr::kable(df_table)
Stratum | N | List.1 | List.2 | List.3 | List.4 | List.5 |
---|---|---|---|---|---|---|
1 | 1600 | 0.113 | 0.113 | 0.199 | 0.832 | 0.733 |
2 | 400 | 0.860 | 0.925 | 0.859 | 0.190 | 0.193 |
We fit models to these examples using 1000 iterations and K=10 maximum latent classes. For Stan, we used four chains and 2000 warmup iterations. For the R LCMCR implementation, we used a buffer size of 10000, thinning=100
, a_alpha=b_alpha=0.25
, with a burn-in of 10000 iterations.
However, we notice two problems immediately:
One of the reasons that the variables are poorly estimated by Stan is because of a problem in latent class models known as label switching. Fundamentally, unless we specify some ordering to the latent classes, there is no reason that “Latent Class 1” and “Latent Class 3” couldn’t swap places. For K latent classes, we have a highly multimodal posterior where each of the K! permutation of the class labels could serve as a valid labeling. In practice, this leads to poor mixing of the chains and unacceptably low effective sample sizes.
We resolve this issue by imposing an ordering on the latent class labels. In particular, we require that the classes be in descending order in size, which we implement by sorting the parameters indexed by latent classes in each “iteration”.
Generally, MCMC relies on an assumption called geometric ergodicity for a central limit theorem to hold, and for parameters to be estimated in an unbiased fashion in computationally reasonable time limits. Stan and its approach of Hamiltonian Monte Carlo have the advantage over previous methods that deviations from geometric ergodicity can be easily detected through the notion of “divergent transitions”. Divergent transitions suggest that estimates may be biased due to poor exploration of the posterior, and should be addressed before trusting the results of the model.
Since the stick-breaking prior is controlled by a beta(1, alpha) prior, and when alpha is small, the breaks tend to concentrate near one, producing highly skewed distributions in latent class size. This may produce numerical issues. We can confirm this behavior by fixing alpha to a predefined value, and examining the behavior of divergences.
Although we could fix the divergent transitions issue by hard-coding alpha (and indeed, setting alpha = 1 is a common prior that is equivalent to two randomly selected individuals having a 50-50 chance of being in the same latent class, producing a small number of latent classes relative to the sample size, cf BDA3 p.553) but we prefer to keep our model more flexible.
We address this through several approaches: first, we reparameterize the stick-breaking code using a logarithmic scale, to make the model more numerically stable. Second, since the issue arises when alpha is very small, we replace the gamma(0.5, 0.5) prior on alpha with a boundary-avoiding lognormal(0, 0.25) prior. The lognormal(0, 0.25) prior has a similar mean (1.03, versus 1 from gamma(0.25, 0.25)), and conveniently prevents alpha from reaching values that are too small. Moreover, it is more symmetric on the log-scale and therefore more computationally friendly to Stan’s adaptive warm-up.
Overall, we have reduced the divergences significantly from the starting model, where:
A quick check of a prior-posterior plot of alpha shows that (a) we’ve successfully bounded alpha away from pathological values close to zero, and (b) that the data is sufficiently informative to update the prior distribution.
Since we learned that the posterior outputs are not too sensitive to the choice of alpha, even when alpha is fixed, we are reassured that this boundary-avoiding prior is an acceptable compromise between computational tractability and model flexibility.
Using our five simulated datasets, we compare the posterior distributions of the estimated population size N in the R LCMCR package with the first iteration and the final iteration of our Stan implementation.
In all five cases, the posterior distributions generated by Stan were very similar to the distributions generated by R, with two out of five simulations statistically indistinguishable by two-sided two-sample Kolmogorov-Smirnov test (p > 0.05).
The Stan code performed comparably in recovering the true values compared to the R code in each of the simulated datasets. Interestingly, we noticed that the first version of the Stan code performed similarly as well, even before fixing the issues with label switching and divergences. However, it was necessary to fix the warnings for us to remain confident in our code as we build further extensions to the model.
Next, we use a real-life dataset consisting of four overlapping lists documenting killings in the Kosovo war. This dataset is interesting because a subsequent project, the Kosovo Memory Book, has performed a near-complete census of the killings, providing N=10401 as a useful ground truth benchmark.
Our posterior distributions are qualitatively similar to the R LCMCR model, and our model estimates the true number of killings with slightly smaller posterior 95% CI and SD.
summaries <- read.csv(here("write", "input", "summaries", "summaries.csv"))
df_summaries_pretty <- summaries %>%
filter(Dataset == "kosovo" & model %in% c("R", "LCMCR_4")) %>%
mutate(model=recode(model, LCMCR_4="Stan", R="R")) %>%
arrange(model) %>%
select(c(model, q025, q500, q975, ci_95_length, mean, sd)) %>%
rename(c(
"2.5th percentile"=q025,
"median"=q500,
"97.5th percentile"=q975,
"95% CI length"=ci_95_length,
"SD"=sd))
knitr::kable(df_summaries_pretty)
model | 2.5th percentile | median | 97.5th percentile | 95% CI length | mean | SD |
---|---|---|---|---|---|---|
R | 8882.675 | 10463.50 | 13652.38 | 4769.700 | 10670.68 | 1207.445 |
Stan | 9005.275 | 10457.75 | 13679.02 | 4673.748 | 10673.43 | 1231.764 |
HRDAG’s use of multiple systems estimation for studying human rights issues has evolved many times over the last twenty years, and our Stan implementation of LCMCR is the latest iteration. Stan allows us to expand the space of possible models that we can implement, allowing us to reach desired future extensions much faster. In particular:
Lastly, although we have confirmed that our Stan model behaves comparably to the R implementation on some simulated and empirical examples, we still need to examine their comparative behavior on more complicated, real-world data. Our Stan implementations and these experiments are available on Github with a GPL license at github.com/HRDAG/StanLCMCR.