We’re helping partners to make sense of about 360,000 Spanish-language chat messages–WeChat, Facebook, SMS–extracted from cell phones that belonged to the perpetrators of a serious human rights crime. Our goal is to identify the messages that are most interesting to analysts and to provide analysts with an extension of basic “search and sort” to more easily work through the full set. Our work is similar to a recent project that ProPublica did with reports of hate crimes.

The analysts have defined about 40 selectors. These are themes that might be in the messages, like “refers to violence,” or “coordinating with informers.” Our partners reviewed about 33,000 messages. For each message, the reviewers indicated if any of the selectors was relevant to the message. Note that for each message, any number of selectors might be relevant. Most of the reviewed messages (91%) were not relevant to any selector, 4% of the messages were relevant to only one selector, and 0.06% of the messages were relevant to 10 or more selectors. Our work was focused on narrowing the dataset so that the investigators could make sure to review the most relevant messages.

We did this work with the International Human Rights Law Clinic (IHRLC) at the UC Berkeley School of Law. The IHRLC pursues a dual mission: promoting justice at home and abroad and training attorneys for a changing profession. IHRLC is one of the leading human rights clinics in the country and marshals the resources of the faculty and students of UC Berkeley to advance the struggle for human rights on behalf of individuals and marginalized communities.

Introduction to methodology

Let’s set up a little bit of notation. We have a population of messages mi,i=1..360,000. We have selectors sk,k=1..40. We have a labeled sample mjk,k=1..33,000. Our goal is to estimate the probability for each message that each selector is relevant, pik.

The simplest way to understand the problem is that it is a form of binary classification. We can develop some kind of features or independent variables for each message, then predict each selector with a logistic regression or random forest or similar algorithm. This is in essence what we did, but there are a few interesting quirks.

We were acutely aware that most of the messages are irrelevant to our partners. Only a few messages–about 3000–had one or more selectors identified, and many selectors had only a few relevant messages. The sparseness of the selector relevance means that we needed to be careful not to include too many predictor variables. Therefore a big part of the feature generation is concentrating the useful information into a small number of variables; this is a kind of dimensionality reduction.

Each step of this project, including data prep, feature generation, modeling, and evaluation, was performed using a combination of R and Python, and we used feather for seamless data sharing across tasks and environments.


There’s not quite enough data to do feature discovery, so we had to generate features manually. There are some obvious features: size of the message, time and date information, the presence and type of attachment. Individual messages are bundled into conversations, and we can generate a few more features from the conversation-level: number of messages in the conversation, time period during which the conversation took place, the app used for the conversation (e.g. WhatsApp or Facebook Messenger). However, most of the useful information is in two less-obviously-quantifiable elements: the content of the message, and the people involved in the conversation.

We decided early on that the best way to use message content for selector prediction was to perform topic modeling on the text of the chat. This involves three steps: preparing data; creating document-word and term frequency-inverse document frequency vectors; and modeling the topics.

Data preparation is the easiest. We started with a csv file with two columns, a message id and the content of the chat message. After reading it into a pandas dataframe, we clean it up in python like this:

    texts = list()
    for chat in docs.content:
        chat = unidecode(chat.lower())
        chat = punct_re.sub("", chat)
        text = [word.strip() for word in chat.split()
                if word not in stoplist and _good_w(word)]
        text = [swap.get(word, word) for word in text]
        text = [lemmadict.get(word, word) for word in text]
        texts.append(' '.join(text))

The unidecode function strips accents; the punct_re regular expression strips punctuation; the _good_w() function just tells us if the token is not a number and more than 2 characters; swap and lemmadict standardize word forms (Spanish lemmatization lists can be found here). Stoplist contains uninformative words (Spanish stopwords can be found here). The result is a “bag of words”, a list of unique words in each chat.

Now it’s easy to use python’s scikit-learn package to create the term frequency/inverse document frequency (tfidf) matrix:

    doc_vzr_tfidf = TfidfVectorizer().fit(texts)
    doc_tfidf = doc_vzr_tfidf.transform(texts)

This matrix represents the presence of each word (“term”) in each document with a weight that indicates how rare the word is (rare words are more interesting than common words).

Now we can model “topics,” that is, groups of words that generally occur together. We looked at latent dirichlet allocation (LDA) but we chose non-negative matrix factorization (NMF) because our tests found that the NMF word clusters seemed more coherent to us. Our seat-of-the-pants finding seems to be consistent with others’ comparisons of LDA with NMF for very short documents like chats and tweets (see e.g., Chen et al. 2019).

We spent some time staring at the topics, but we don’t think they have a lot of substantive interpretation. One of the topics seemed to include mostly profane words and words that refer to genitals; another seemed to refer to money and payments; while a third seemed to collect words about movement, time, and location. Our goal is to organize the words into a few topics for predictions, not to interpret the topics, so we tried not to get too caught up in this step.

The second important aspect of messages is who’s talking: who are the particpants in a chat? The participants are in some ways like words in the chats: there are many more participants than we can include as independent variables, so we’d like to group the participants into cliques. Like topics, the cliques don’t really have any substantive interpretation. We’re just looking for a way to reduce the information from the chat participants into a limited number of independent variables. We used the same procedure to group participant cliques that we used to group words: we organized the names and phone numbers into references to unique people, created a message-participant matrix, transformed it to tf-idf, fit an NMF, and predicted the results.

These three sets of variables were our predictors: simple metadata; content in topics; and participants in cliques. Now we’re ready to predict selectors.


As we suggested above, we’ll start by fitting simple binary classifiers to predict each selector. Some of the selectors had very few relevant messages, so we decided to predict only selectors that had at least 100 relevant messages. We split the labeled data randomly into training and test sets, fit the classifiers, and assessed the fit using the held-out test data. We tuned hyperparameters a little bit, using cross-validation for L1 and L2 regularization for logistic regression, and the number of trees for the random forest. The random forest produced consistently slightly better fits (measured by AUC against the test data), so we chose it.

But we can do a little better. Remember that each message can be relevant to zero, one, or many selectors. This can be understood as a multi-label problem, and the advantage is that we can predict the selectors in a sequence. The first selector is predicted by itself, but the second predictor uses the same set of independent variables plus the prediction from the first selector. This chained prediction gains power if the selectors tend to occur (or not occur) together. It’s sensitive to the order of the selectors introduced into the model, so we run it a number of times and take the mean predictions over all the iterations. scikit-learn has a great way to do chained estimates. Here’s how we did it.

def chained(deps, ch=None, Y_train=None, X_train=None, Y_test=None,
            X_test=None, X_all=None, yamlfile=None):

    clf = RandomForestClassifier(n_estimators=150, n_jobs=-1)

    chains = [ClassifierChain(clf, order='random', random_state=i)
              for i in range(10)]

    for chain in chains:
        chain.fit(X_train, Y_train)
    logging.info("chains fit.")

    y_preds = np.array([chain.predict_proba(X_test)
                        for chain in chains]).mean(axis=0)

    aucs = [float(roc_auc_score(Y_test.iloc[:, i], y_preds[:, i]))
            for i in range(len(deps))]
    briers = [float(brier_score_loss(Y_test.iloc[:, i], y_preds[:, i]))
              for i in range(len(deps))]
    logging.info(f"chained tests done.")

    preds = pd.DataFrame(np.array(
        [chain.predict_proba(X_all) for chain in chains]).mean(axis=0))

    preds['msg_id'] = ch.msg_id
    cols = deps + ['msg_id']
    preds.columns = cols
    logging.info("chained predictions made")
    return aucs, briers, preds


We used area under the ROC curve on the held-out test data to compare different models. Imagine we want to review messages that are relevant to a particular selector. We’ll start by sorting all of the messages according to the predicted probability for that selector. The messages now at the top of the list should be likely to be relevant to the selector, or in other words, true positives. As we work our way down the list, though, we’ll start to see more false positives, so that if we want to continue collecting and reviewing truly relevant messages we’ll have to pay the cost of reviewing additional false positives. The ROC curve, by plotting false positive rate against true positive rate, lets us visualize the tradeoff between the two.

Helpfully, the area under this curve can be interpreted as the probability that a random positive instance will have a higher score than a random negative instance, making it a good metric for the use-case we’re describing here. Our chained random forest model performed well on held out test data by this metric, achieving an AUC of over .9 for most selectors (a perfect classifier would have an AUC of 1).

We were still concerned about generalizability, though. The training data we had used to build and evaluate the models consisted of the messages that analysts had already labeled, but they had not reviewed messages randomly. For example, some of the conversations in the corpus contributed hundreds or thousands of labeled messages to the training data, while other conversations were entirely unrepresented. In order to verify that our model would still be useful when applied to the rest of the unlabeled data, we set up a randomized test. We selected just under 1100 messages, including some at different score ranges for each selector, and asked the analysts to tag these messages with selectors. The resulting AUCs (for each selector) were a little bit lower than what we’d calculated on the original test data, but still acceptable – around .77-.87 depending on selector.

Since the scores are not necessarily perfectly calibrated, we couldn’t simply compare scores for one selector to scores for another in order to summarize performance over all selectors. Instead, we converted scores to their associated rank within each selector category, and created an overall ROC curve for all selectors using the ranks. Here of course we’re switching to R.

comparisons <- model_scores %>%
    gather(selector, score, -msg_id) %>%
    inner_join(hand_labels %>%
        gather(selector, label, -msg_id), by = c("msg_id", "selector"))

comparisons %>%
    group_by(selector) %>%
    mutate(score_rank = rank(score)) %>%
    ungroup %>%
    arrange(desc(score_rank)) %>%
    mutate(true_pos_rate = cumsum(label) / sum(label),
           false_pos_rate = cumsum(1 - label) / sum(1 - label)) %>%
    ggplot(aes(x = false_pos_rate, y = true_pos_rate)) +
    geom_point(size = 1, stroke = 0) +
    geom_abline(slope = 1, intercept = 0, linetype = 3) +
    scale_x_continuous(labels = percent) +
    scale_y_continuous(labels = percent) +
    ggtitle("Combined ROC curve for all selectors")
single ROC curve for all selectors, for the final chained random forest model

single ROC curve for all selectors, for the final chained random forest model


We provided our predictions to our partners in the form of a spreadsheet with the original chat message data with an additional column for each selector containing our model’s predicted probability that the selector is relevant to the message. This spreadsheet can be sorted to quickly identify messages of interest.