In [1]:
from IPython.display import display, Math, Latex
from IPython.display import Image
Image(filename='../input/hrdag.png')
Out[1]:

Modeling the location of hidden graves in Mexico

A first look

23 March 2017
Patrick Ball, Human Rights Data Analysis Group

In this project, we use a RandomForest to model for each county (municipio) in Mexico, the probability of identifying a hidden grave (fosa clandestina) using a set of independent variables. In short, the model works -- more on this below.

This work is a collaboration between HRDAG and two partners. The Programa de Derechos Humanos of the Universidad Iberoamericana created the database of clandestine graves, and they wrote about this analysis in a blogpost here. DataCívica developed the database of predictor variables, and they have written a complementary analysis in a post here. Their reports provide substantially more context -- this post is primarily technical.

Scope and interpretation of this analysis

In the model below, we're predicting whether one or more clandestine graves will be discovered. Known graves are necessarily a subset of all graves because there are certainly graves that are not discovered. Therefore it is possible that there are graves with different perpetrators or different causes that would not be predicted by this model. This means, for example, that if only predictions like these were used to drive future searches for graves, then it’s possible (even likely) to miss graves that are unlike those that have been found before but are probably still out there.

This is an epistemological challenge. The interesting question is "where are the hidden graves?" But what the model answers is "in which counties will we discover graves that are like the graves we have found in the past?"

In this context, similarity --- graves like the ones we have found in the past --- means that the social production of knowledge about the graves found in the past will be similar to graves found in the future. The question of "findability" should therefore be central to thinking about these results. How were these graves found? Is there something about these counties, or these graves, which makes them findable, while other graves remain unfindable? Do these graves hold the victims of perpetrators who share some methods, while other perpetrators are more successful at hiding graves? We do not address those questions here.

The very high precision in the prediction may mean that counties in which graves become known are wildly different values on many variables relative to the counties which are thought to have very low probabilities of graves. This provides a direction for future research; more on this below.

Findings from this analysis

We can predict the counties (municipios) that have known clandestine graves (fosas clandestinas) in 2013 and 2014. Here are the steps we took:

  • The dependent variable, organized by the team from the Universidad Iberoamericana, contains data from 2013 and 2014 for municipalities in which graves were found and for municipalities the team defined as very unlikely to have graves (2013: 46 yes, 94 no; 2014: 80 yes, 94 no; it is a coincidence that the two "no" counts are equal). I call the counties for which the Ibero team defined whether a grave was found or not the "counties with known grave status."

  • A number of predictor variables were organized by the team from Data Cívica. We will not interpret these variables here. In prediction, the predictor variables are not considered important, which is quite different from inferential analysis. Instead, the quality of the prediction is measured by how close the predicted values are to the observed values. The ten most relevant variables are listed for reference.

  • For both years, the model was built by dividing the counties with known grave status randomly into two groups. Two-thirds of the counties were randomly selected to be used to train a Random Forest model, and the remaining one-third of the counties were held out and used for testing. I repeated the random division, training, and testing 500 times.

  • For both 2013 and 2014, in each of the 500 iterations, the testing data was predicted perfectly. That is, looking only at the counties that were hidden from the model training, the model was able to predict which counties would have known graves, and which counties the Ibero team defined as having very low probability of a grave.

  • By perfect prediction, I mean no false positives, no false negatives, AUC=1.0, brier loss < 0.01, in every iteration. Keep in mind that this result is on held-out testing data.

  • In the final step, I used the model trained with data from 2013, and with the predictor data from 2014 to predict the grave status for all counties in 2014. The prediction was almost perfect: all counties with graves were correctly predicted, but a few counties without graves had predictions slightly greater than zero. This finding implies that models from a previous year can successfully predict where graves will be discovered in the current year. This is the most exciting result, and is worth additional research.

Next steps

The most urgent application is to map the clandestine graves from 2016, and retest the model for that year. If it is as accurate as it is for 2013 and 2014, we could use its predictions for 2017 to guide the search for additional graves.

It would be interesting to continue identifying counties in which there is a very low probability of finding a grave. This would be particularly useful for 2015 and 2016. These counties enable the model to distinguish between patterns of predictor variables associated with finding graves, and patterns associated with a low probability of graves. The more counties we can define as non-grave finding, the more generalized the model can become, and the better its predictions will be about which counties are the best targets for additional searches.

The code

I've included the code below for a more detailed description of what we did. For more discussion of the result, I recommend the posts by DataCívica and the team from the Iberoamericana.

In [2]:
import pandas as pd
import numpy as np
from pprint import pprint

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics
from sklearn.preprocessing import Imputer

from IPython.display import clear_output
In [3]:
def make_dcfo(targ_year):
    ''' read and merge datasets, impute missing values, and define fields_to_use '''
    # prep DataCivica independent variables 
    dc = pd.read_csv("../input/final_ibero.csv.gz", encoding='latin1')
    dc = dc[dc.year == targ_year]
    # drop variables that are all NULL 
    dc.dropna(axis=1, how='all', inplace=True)
    dc.set_index(['inegi'], inplace=True, drop=False)

    # convert to integer 
    dc.fedlib = dc.fedlib.map(
        lambda f: {'SI': 1, 'NO': 0}[f]).astype(int)

    # convert categorical to dummies
    carr = pd.get_dummies(dc.carretera, prefix='carr') 
    dc = pd.merge(dc, carr, left_index=True, right_index=True)
    dc.drop(['carretera'], axis=1, inplace=True)

    print("\nfound {} recs with {} non-null independent vars for year == {}".format(
            len(dc), len(dc.columns), targ_year))

    # prep Ibero's list of municipios with known fosas 
    fo = pd.read_excel("../input/fosas_patrick.xlsx", 
                       sheetname=str(targ_year))
    fo.rename(columns={'Clave de municipio (INEGI)': 'inegi',
                       'Presencia de fosas': 'fosas'}, 
               inplace=True)
    if targ_year == 2013:
        assert len(fo[fo.fosas == 1]) == 46
    elif targ_year == 2014:
        assert len(fo[fo.fosas == 1]) == 80

    # keep only the muni code and the fosa count
    fo = fo[['inegi', 'fosas']]
    fo.set_index(['inegi'], inplace=True, drop=False)
    print(fo.info())

    # set these for the rest of the run
    fields_to_exclude = set(['year', 'inegi', 'munname', 'entidad', 'fosas'])
    fields_to_use = [f for f in dc.columns if f not in fields_to_exclude]
    
    # merge Ibero's fosas data with DC's independent variables 
    dcfo = pd.merge(dc, fo, how='left', on='inegi')
    dcfo[dcfo.fosas >= 1] = 1
    print(dcfo.fosas.value_counts(dropna=False))
    
    # fill in all the missing values. RandomForest cannot use municipios 
    # with missing values, so we replace missing values with the column's
    # median value. This is a standard approach for missing data with RandomForests
    # and I don't think it makes much difference here. However, if we can 
    # improve the predictor variables to reduce the missing data, that would be
    # slightly better. 
    imputer = Imputer(missing_values="NaN", strategy="median", axis=0)
    dcfo_X = imputer.fit_transform(dcfo[fields_to_use])
    dcfo_X = pd.DataFrame(dcfo_X, columns=fields_to_use)

    dcfo.drop(fields_to_use, axis=1, inplace=True)
    dcfo = pd.concat([dcfo, dcfo_X], axis=1)
    return dcfo, fields_to_use, fields_to_exclude
In [4]:
def dcfo_split(dcfo, train_pct=0.5, verbose=False):
    ''' divide dcfo with known fosa state randomly by train_pct '''
    dcfo['rand'] = np.random.random(len(dcfo))
    labeled = dcfo[pd.notnull(dcfo.fosas)]
    train = labeled[(labeled.rand <= train_pct) & (labeled.fosas >= 0)]
    test = labeled[(labeled.rand > train_pct) & (labeled.fosas >= 0)]
    dcfo.drop(['rand'], inplace=True, axis=1)
    del labeled
    if verbose:
        print("train:")
        print(train.fosas.value_counts())
        print("test:")
        print(test.fosas.value_counts())
        print("test and training ready.")
    return train, test 
In [5]:
def train_and_chk(train, test, fields_to_use, tol=1, verbose=False, auc=False):
    ''' create and test a RF model '''
    def predict_and_chk(model, df, tol=1, verbose=verbose, auc=auc):
        ''' convenience function to chk training and testing classification '''
        Y = model.predict(df[fields_to_use])
        cf = sklearn.metrics.confusion_matrix(df.fosas, Y)
        assert (cf[0,1] + cf[1,0]) <= tol
        if verbose:
            print(cf)
        if verbose and auc:
            Y = model.predict_proba(df[fields_to_use])[:, 1]
            fpr, tpr, thresholds = sklearn.metrics.roc_curve(df.fosas, Y, pos_label=1)
            auc = sklearn.metrics.auc(fpr, tpr)
            print('AUC:', auc)
            brier = sklearn.metrics.brier_score_loss(df.fosas, Y, pos_label=1)
            print('brier:', brier)
    # This is the algorithm
    cls = RandomForestClassifier(n_estimators=200)
    # This trains the model
    model = cls.fit(train[fields_to_use], train.fosas)
    # this checks the training: it should be close to perfect
    predict_and_chk(model, train, tol=tol)
    predict_and_chk(model, test, tol=tol, auc=auc)
    return model 

# example use
print("2013:")
dcfo, fields_to_use, fields_to_exclude = make_dcfo(2013)
train, test = dcfo_split(dcfo, verbose=True, train_pct=0.66)
model = train_and_chk(train, test, fields_to_use, tol=1, verbose=True, auc=True)
print("\n2014:")
dcfo, fields_to_use, fields_to_exclude = make_dcfo(2013)
train, test = dcfo_split(dcfo, verbose=True, train_pct=0.66)
model = train_and_chk(train, test, fields_to_use, tol=1, verbose=True, auc=True)
del dcfo, fields_to_use, fields_to_exclude, train, test, model 
2013:

found 2316 recs with 49 non-null independent vars for year == 2013
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2457 entries, 1001 to 32058
Data columns (total 2 columns):
inegi    2457 non-null int64
fosas    2457 non-null int64
dtypes: int64(2)
memory usage: 57.6 KB
None
-1    2176
 0      94
 1      46
Name: fosas, dtype: int64
train:
0    60
1    30
Name: fosas, dtype: int64
test:
0    34
1    16
Name: fosas, dtype: int64
test and training ready.
[[60  0]
 [ 0 30]]
AUC: 1.0
brier: 7.36111111111e-05
[[34  0]
 [ 0 16]]
AUC: 1.0
brier: 0.0020355

2014:

found 2316 recs with 49 non-null independent vars for year == 2013
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2457 entries, 1001 to 32058
Data columns (total 2 columns):
inegi    2457 non-null int64
fosas    2457 non-null int64
dtypes: int64(2)
memory usage: 57.6 KB
None
-1    2176
 0      94
 1      46
Name: fosas, dtype: int64
train:
0    63
1    29
Name: fosas, dtype: int64
test:
0    31
1    17
Name: fosas, dtype: int64
test and training ready.
[[63  0]
 [ 0 29]]
AUC: 1.0
brier: 7.25543478261e-05
[[31  0]
 [ 0 17]]
AUC: 1.0
brier: 0.0013171875

code discussion

The code in the train_and_chk function above estimates a RandomForest with the training data, checks that the training data has been properly classified, and then checks that the testing data has also been properly classified.

In each case, "properly classified" is measured by the confusion matrix (also see the simpler explanation here but note that sometimes the Yes/No rows and columns are interchanged -- in both of these explanations they have the TP in the upper left and the TN in the lower right, which is the opposite of how it is shown here, sorry). A confusion matrix's upper left cell shows the number of municipios that the Ibero's team classified as having no fosas and the model classifies as having no fosas; these are called "true negatives" (TN); the upper right cell shows the number of municipios that are classified by the Ibero team as having fosas but by the model as not having them, these are "false negatives" (FN); in the lower right corner are the municipios that Ibero's team has identified as having fosas and in which the model predicts them, "true positives" (TP); the final cell in the lower left are "false positives" (FP).

   TN FN 
   FP TP

The confusion matrices shown here have all the cases in the upper left and lower right, which indicate perfect prediction. This is tested in each iteration.

In [6]:
# this cell runs the split and train functions some number of times
# (500 is set here). 
def iterate_RF(targ_year):
    dcfo, fields_to_use, fields_to_exclude = make_dcfo(targ_year)
    
    print('{} fields to use:\n'.format(len(fields_to_use)), fields_to_use)
    importances = pd.DataFrame(fields_to_use, columns=['field'])
    
    print(importances.info())
    
    preds = dcfo[['inegi', 'munname', 'fosas']].copy()
    pfields = list()
    loops = list()

    for i in range(500):
        if i % 10 == 0:
            loops.append(str(i))
            clear_output()
            print(','.join(loops))
        pfield = 'p{:02}'.format(i)
        pfields.append(pfield)

        train, test = dcfo_split(dcfo, train_pct=0.66)
        model = train_and_chk(train, test, fields_to_use, 
                              tol=0, verbose=False, auc=False)
        
        importances[pfield] = model.feature_importances_
        preds[pfield] = model.predict_proba(dcfo[fields_to_use])[:, 1]
        
    preds['prob'] = preds[pfields].mean(axis=1)
    preds['std'] = preds[pfields].std(axis=1)
    preds.sort_values(['prob'], ascending=False, inplace=True)
    print(preds[preds.fosas == -1][['inegi', 'munname', 'prob', 'std']].head(10))
    
    importances['prob'] = importances[pfields].mean(axis=1)
    importances['std'] = importances[pfields].std(axis=1)
    importances.sort_values(['prob'], ascending=False, inplace=True)
    print(importances[['field', 'prob', 'std']].head(10))

print('done.')
done.

results of interate_RF()

This function divides the data into training and testing sets and estimates a Random Forest model 500 times. The primary goal of the function is to makes sure that in all the random divisions of the data into training and testing sets, the testing set is always predicted perfectly.

As a secondary benefit, the ten counties classified as most likely to have graves are listed. The probabilities estimated are not very high -- they are all below 0.5 -- so this is not a strong indication that there will be graves found in these counties. Nonetheless, these are the counties most similar to the counties in which graves were found.

The next table describes the ten variables which have the greatest impact on the model. This paper explains how to interpret the variable importances; for the purpose of this analysis, we only want to know approximately which kinds of measures are most interesting for the prediction.

In [7]:
iterate_RF(2013)
0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490
      inegi                         munname     prob       std
1795  26009                        Bacanora  0.33254  0.043286
1827  26044                          Onavas  0.31114  0.048255
1302  20365         Santa Catarina Lachatao  0.30481  0.052016
1291  20352             San Simon Zahuatlan  0.28520  0.050069
1084  20116           San Bartolome Ayautla  0.24341  0.048987
1255  20314            San Pedro Juchatengo  0.24149  0.049517
1400  20480               Santiago Nundiche  0.24054  0.047584
1359  20431          Santa Maria Tecomavaca  0.23957  0.049587
1388  20463             Santiago Huauclilla  0.23681  0.036848
1003  20027  Chiquihuitlan de Benito Juarez  0.22439  0.047628
                 field      prob       std
12        gradopromesc  0.067015  0.018563
6   espinsufSecundaria  0.065856  0.017422
8   matinsufSecundaria  0.065680  0.017477
31                labs  0.065366  0.017697
0                  pob  0.064919  0.018032
2     espinsufPrimaria  0.064824  0.018517
13          ingmunreal  0.064615  0.018160
4     matinsufPrimaria  0.064585  0.019291
34                opio  0.064312  0.017737
22            frontsur  0.043114  0.021562
In [8]:
iterate_RF(2014)
0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490
      inegi                 munname     prob       std
1834  26055   San Luis Rio Colorado  0.37364  0.075789
1826  26043                 Nogales  0.32128  0.078914
12     2002                Mexicali  0.31531  0.081387
1780  25012                Mazatlan  0.28733  0.042276
1889  28022               Matamoros  0.28047  0.068896
43     5013                 Hidalgo  0.24563  0.070171
1224  20276   San Miguel Santa Flor  0.24176  0.045271
1470  20563                  Yogana  0.24158  0.045164
1828  26045                 Opodepe  0.24098  0.035394
1263  20322  San Pedro Ocopetatillo  0.24071  0.042680
           field      prob       std
4     ingmunreal  0.094744  0.023172
25          opio  0.094308  0.022609
0            pob  0.094100  0.021876
3   gradopromesc  0.093441  0.021415
24       heroina  0.093044  0.022422
13      frontsur  0.063256  0.027836
22          labs  0.062966  0.027926
6      distfront  0.060752  0.023381
12    frontnorte  0.060175  0.022323
5       munfront  0.059688  0.023152
In [9]:
dcfo2013, fields_to_use2013, fields_to_exclude2013 = make_dcfo(2013)
dcfo2014, fields_to_use2014, fields_to_exclude2014 = make_dcfo(2014)

fields_to_use = list(set(fields_to_use2013) & set(fields_to_use2014))

# generate a model with 2013 known-fosas
train, test = dcfo_split(dcfo2013, train_pct=0.66)
model = train_and_chk(train, test, fields_to_use, tol=0, verbose=True, auc=True)

# predict 2014 munis with 2014 predictor data 
dcfo2014['prob'] = model.predict_proba(dcfo2014[fields_to_use])[:, 1]
dcfo2014.groupby(['fosas']).prob.mean()
found 2316 recs with 49 non-null independent vars for year == 2013
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2457 entries, 1001 to 32058
Data columns (total 2 columns):
inegi    2457 non-null int64
fosas    2457 non-null int64
dtypes: int64(2)
memory usage: 57.6 KB
None
-1    2176
 0      94
 1      46
Name: fosas, dtype: int64

found 2316 recs with 40 non-null independent vars for year == 2014
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2457 entries, 1001 to 32058
Data columns (total 2 columns):
inegi    2457 non-null int64
fosas    2457 non-null int64
dtypes: int64(2)
memory usage: 57.6 KB
None
-1    2142
 0      94
 1      80
Name: fosas, dtype: int64
[[58  0]
 [ 0 27]]
AUC: 1.0
brier: 0.000211470588235
[[36  0]
 [ 0 19]]
AUC: 1.0
brier: 0.000569545454545
Out[9]:
fosas
-1    0.034374
 0    0.009787
 1    1.000000
Name: prob, dtype: float64

Acknowledgements

I am very grateful to my HRDAG colleague Dr Kristian Lum for helpful suggestions about how to explain this work. I want to thank the teams from the Universidad Iberoamericana and from DataCívica for their work organizing the data and thinking about what it means.

In [ ]: