1 Motivation

The statistical core of HRDAG’s methodology is to estimate undocumented events using several intersecting datasets, a technique called multiple systems estimation (MSE). Since 2016, HRDAG has used a fully Bayesian technique using latent class models, called LCMCR, with a corresponding R package.

The R package is a highly optimized Gibbs sampler, which has both pros and cons. On the one hand, the technique is fast and reliable, and has produced useful estimates for HRDAG over the last six years. On the other hand, the heavy optimization combined with the nature of Gibbs sampling makes it difficult to extend the model to accommodate desired future features.

To increase the future extensibility of our models, we have implemented LCMCR in Stan, an easy-to-use platform for statistical modeling, where we input the model in the declarative format of the Stan probabilistic programming language. All of the Stan code, and the code for running the diagnostic experiments below, are available on Github.

2 Diagnosing the Stan implementation

As a first order of business, we generate five simulated datasets (all of which have a true underlying population size of N=2000), and compare the outputs between the R LCMCR package and our Stan implementation.

The first simulation is identical to the simulation used in Manrique-Vallier (2016). There are two strata, with 90% of the population falling into the first strata, and 10% falling into the remaining strata, and five lists. The probability of being observed in a list by strata is as follows:

pacman::p_load(here, yaml, dplyr, magrittr, LCMCR)

simulations <- read_yaml(here("write", "hand", "simulations.yaml"))
datasets_raw <- simulations$data

extract_table_df <- function(dataset_rows) {
    strata <- c()
    ns <- c()
    l1 <- c()
    l2 <- c()
    l3 <- c()
    l4 <- c()
    l5 <- c()
    for (i in 1:length(dataset_rows)) {
        raw_row <- dataset_rows[[i]]
        strata <- c(strata, i)
        ns <- c(ns, raw_row$size)
        l1 <- c(l1, raw_row$probs[[1]])
        l2 <- c(l2, raw_row$probs[[2]])
        l3 <- c(l3, raw_row$probs[[3]])
        l4 <- c(l4, raw_row$probs[[4]])
        l5 <- c(l5, raw_row$probs[[5]])

    data.frame(Stratum=strata, N=ns, `List 1`=l1, `List 2`=l2, `List 3`=l3, `List 4`=l4, `List 5`=l5)

df_table <- extract_table_df(datasets_raw[[1]]$rows)
Stratum N List.1 List.2 List.3 List.4 List.5
1 1800 0.033 0.033 0.099 0.132 0.033
2 200 0.660 0.825 0.759 0.990 0.693

For the second simulated dataset, we lower the probabilities of observation, and increase the proportion of people in the second stratum:

df_table <- extract_table_df(datasets_raw[[2]]$rows)
Stratum N List.1 List.2 List.3 List.4 List.5
1 1600 0.013 0.013 0.099 0.132 0.033
2 400 0.560 0.825 0.759 0.990 0.593

For the third simulated dataset, we create an example with higher probabilities of observation:

df_table <- extract_table_df(datasets_raw[[3]]$rows)
Stratum N List.1 List.2 List.3 List.4 List.5
1 1600 0.113 0.113 0.199 0.232 0.133
2 400 0.860 0.925 0.859 0.990 0.793

For the fourth simulated dataset, we introduce three strata:

df_table <- extract_table_df(datasets_raw[[4]]$rows)
Stratum N List.1 List.2 List.3 List.4 List.5
1 1600 0.1 0.05 0.15 0.2 0.1
2 200 0.5 0.40 0.60 0.4 0.5
3 200 0.8 0.90 0.60 0.7 0.7

For the fifth simulated dataset, we introduce negative list dependencies across the strata:

df_table <- extract_table_df(datasets_raw[[5]]$rows)
Stratum N List.1 List.2 List.3 List.4 List.5
1 1600 0.113 0.113 0.199 0.832 0.733
2 400 0.860 0.925 0.859 0.190 0.193

We fit models to these examples using 1000 iterations and K=10 maximum latent classes. For Stan, we used four chains and 2000 warmup iterations. For the R LCMCR implementation, we used a buffer size of 10000, thinning=100, a_alpha=b_alpha=0.25, with a burn-in of 10000 iterations.

However, we notice two problems immediately:

  1. The effective sample size and split R-hat diagnostics for several of the variables are unacceptably high.
  2. The number of divergences reported by Hamiltonian Monte Carlo are unacceptably high (often around 30% of iterations are divergent).

2.1 Label switching

One of the reasons that the variables are poorly estimated by Stan is because of a problem in latent class models known as label switching. Fundamentally, unless we specify some ordering to the latent classes, there is no reason that “Latent Class 1” and “Latent Class 3” couldn’t swap places. For K latent classes, we have a highly multimodal posterior where each of the K! permutation of the class labels could serve as a valid labeling. In practice, this leads to poor mixing of the chains and unacceptably low effective sample sizes.