New results for the identification of municipalities with clandestine graves in Mexico
This post is a translation from the Spanish. It is co-authored by Mónica Meltis, Jorge Ruiz and Patrick Ball.
In March of this year we presented the first results from a joint research project in which we used a statistical model to predict the existence of clandestine graves (fosas clandestinas) in Mexican municipalities. These results are available on HRDAG, Data Cívica, and PDH Ibero. Since the results were published, this group of organizations (HRDAG, PDH Ibero and Data Cívica), has expanded and refined the statistical model’s accuracy and scope. The updated results derived from this process are presented in the following pages.
The goal of this project is identify Mexican municipalities with a high probability of having clandestine graves. Knowing where to search will help to create better public programs regarding missing persons in Mexico.
We start from the premise that there is a universe of clandestine graves that are still undocumented either by the national and local press or by the local and federal governments. With this assumption as our starting point, our statistical model predicts which municipalities are likely to have still unobserved clandestine graves with characteristics similar to the graves that have already been found and identified.
In other words, starting from a dataset of clandestine graves identified in newspaper entries built by the PDH IBERO, and with additional information given by official government sources, we have identified municipalities in which none of these sources have thus far registered the existence of clandestine graves, but in which our model shows there is a high likelihood that they do in fact exist.
It is important to stress that, for the first stage of this project, we only used the dataset of clandestine graves identified in newspaper clippings built by the PDH IBERO. The predictions for the years 2015 and 2016 derived from this exercise were reported in June 2017. For the second stage of the project, the results of which we are reporting now, we also included in the model information given by government sources spanning the same period. This allows us to compare and evaluate the accuracy of our model.
How does the model work? What are the new results?
To start, we assigned to each Mexican municipality one of three possible values: 1, 0 or -1. We gave a 1 when the existence of clandestine graves (fosas) in that specific municipality was reported in the press. The value of 0 was given by the team and, for statistical purposes, means there are no graves in the municipality. It was given to 100 municipalities for which, considering contextual information, we believe the probability of there being clandestine graves is close to zero. The rest of the municipalities, all without press reports of clandestine graves, were assigned a value of -1. It is to these municipalities to which the model assigns probabilities, using the characteristics of the other municipalities, classified with either a 0 or a 1, to do so. After this process, other geographic, sociodemographic and violence variables were included for each municipality. Using this information, the model, called a ‘random forest’, learns to identify similar municipalities with the aggregated variables and can later assign a specific probability of the existence of clandestine graves to each.
The resulting classification is the product of training the model to identify the probability of a grave existing in a certain municipality. The training consists in showing the model half of the observations at a time, randomly selected. For example, if we had a sample of 100 observations, we would show the model only 50 and “hide” the remaining 50 observations. By doing this, the model has a sample with which to test how well it could identify the first 50 observations. By training the model in this manner, the model generates classification trees for each municipality, using a unique combination of variables for tree, and synthesizing a final prediction from the combination of trees.
Once the model has been trained with these observations, it is shown the observations which were previously withheld to test how well it is able to classify municipalities with graves. This process results in a list of probabilities for each of the municipalities where press sources have not already reported the existence of graves. For 2015, these probabilities are in the following table:
The above table lists the municipalities with the highest probabilities of having clandestine graves. They are similar to those municipalities in which the press had already reported the actual presence of such graves. It is noticeable that the top five municipalities were associated with probabilities higher than 85%. Apatzingán in Michoacán has the highest probability( 93%) of having clandestine graves.
Having done this exercise, we can confirm that the model is capable of finding municipalities similar to those in which the press has confirmed the existence of graves. As was mentioned before, we can be certain of this because the model was trained with multiple iterations, that used only half of the observations, to identify the municipalities in which the press has reported the presence of clandestine graves in 2015 and 2016. Using only half of the information, the model has to be able to classify municipalities according to the presence or absence of clandestine graves there accurately.
For the data comprising 2015 and 2016, a thousand iterations for each year were made. The results for the first year can be seen in the following chart:
Positive predictive value for 2015:
The previous figure shows the positive predictive value of the model which uses only information derived from news outlets for 2015. The positive predictive value shows the frequency with which the model correctly identifies municipalities with previously observed clandestine graves. The prediction uses half of the information to predict the other half. This means that from the thousand iterations that were made, the model correctly predicted municipalities that had a value of 1 assigned in our lists 355 times. There were no instances of “false positives” (municipalities that the model predicted had graves when, in fact, there were none). The graph also shows how municipalities with clandestine graves were correctly classified more than half the time.
In this section we present the new results for the years 2015 and 2016, after having added to the model the recent information provided by government agencies. The results can be summarized with the following couple of points:
- With new information we received on the identification of clandestine graves by local and state attorney’s offices, we confirmed that several municipalities to which our model had previously assigned a high probability of having graves did, in fact, later prove to have such graves. This means our previous prediction of which municipalities were likely to have clandestine graves in 2015 and 2016 were confirmed with official information spanning the same time period. In other words, we proved that our model works. Some of the municipalities in which we predicted it was likely to see graves, and in which the new information corroborates that, in fact, they do have graves are: Lázaro Cárdenas, Michoacán; San Fernando, Tamaulipas; and Nogales, Sonora (The first two had a probability higher than 80% in the previous table.)
- We generated new models for 2015 and 2016, now including the new information provided by official sources. Afterwards, we compared our previous results with our new results and found that, for several municipalities, both models give them a high probability of having clandestine graves although these graves remain unidentified. This will be expanded on below.
The following graphs show the probabilities that both models assign to each municipality, represented by colored dots. In the horizontal axis, the plotted probability is the one that resulted from the first model, which uses only information gathered from press outlets. The vertical axis plots the probability resulting from the second model, which includes the information provided by local state attorney’s offices. The diagonal line crossing the graphs marks the places where both models assign the same probability.
In each graph, every individual municipality is represented by a single dot. The red colored dots are municipalities in which the press reported the existence of a clandestine grave. The blue colored dots represent municipalities in which local authorities have reported the existence of at least one clandestine grave. White dots represent municipalities in which we know there are no graves (those municipalities which were given a value of 0 after the context analysis). Lastly, municipalities represented by green dots are those in which neither press nor official sources have reported the the presence of clandestine graves.
The municipalities, represented by the differently colored dots in the graphs, which are clustered in the top right corner are those to which both models assign a high probability of there being clandestine graves. In the top left corner, municipalities to which the model that includes official information from state attorney’s assigns a high probability, but the original model which uses only information provided by press and news outlets gives a low probability to, are clustered. The municipalities in the lower right corner, are reversely, those to which the original model assigned a high probability of having clandestine graves but to which the recent model gives low probabilities.
The intersection of both models, which using two different sources of information report the existence of clandestine graves for the same time period, highlights those municipalities to which we should be paying special attention: the ones clustered in the upper right-side corner. These municipalities have been assigned high probabilities of having clandestine graves by both models.
The existence of these municipalities necessarily means one of two things: the first, that clandestine graves have already been discovered there, but have gone either unreported or have been reported but have gone unregistered in our database. The second is that there in fact are graves in these municipalities, but they have not been discovered yet. If the second possibility proves to be true, given the high probabilities of the existence of clandestine graves that our model assigns to these municipalities, we have statistical reasons to believe that searches for graves should be conducted there.
Some of the municipalities where clandestine graves have remained unreported by both press and official sources, but to which our model assigns a high probability of the existence of these graves in 2015 are: Cuauhtémoc, Chihuahua; Ahomé, Sinaloa and Apatzingán, Michoacán. This list of municipalities is unsurprising, given the high levels of violence that these places have. The main thing is that thus far graves have not been reported there by either source.
Some of the municipalities where clandestine graves have remained unreported by both press and official sources, but to which our model assigns a probability upwards of 75% of them having these graves in 2016 are: González y Altamira, Tamaulipas; Asunción Cacalotepec, Oaxaca and Pueblo Nuevo, Durango.
When comparing the results with the previous year, our model with both sources of information estimates that in 2016 Cuauhtémoc, Chihuahua, for example, still has a probability higher than 82% of having clandestine graves. The same occurs with Apatzingán in Michoacán, which still has a probability of more than 87%.
Besides, for 2016, other municipalities like Españita in Tlaxcala (with a 92% probability in the model with government information and 75% probability in the model with only information from news sources), Atolinga, Zacatecas (with a 91% and 85% probability in each model, respectively) and Mazatlán, Sinaloa (with a 78% and 93% probability in each respective model) proved to have high probabilities of having clandestine graves.
Predicting the probabilities of the existence of clandestine graves at the municipal level limits the possibility of undertaking concrete search actions. This is a result of the wide variation of sizes between municipalities in Mexico, as well as of the lack of precise information regarding the exact the location of the burial sites.
We therefore recognize the need to collect additional qualitative information related to the geographic characteristics of the precise localities where these graves have been found. Information like the type of terrain, the nearness or remoteness of roads and highways, the presence of rivers, cliffs or mountains, just to mention a few, would allow us to pinpoint the localities within a given municipality where we are more likely to find these types of graves.
In recognition of this, we find that it is fundamental to include into our analysis information produced by other actors, for example: association of families in search of missing relatives, journalists, forensic teams and government institutions among others, which have gathered valuable data during their years spent searching.
In the following months, our research with be directed towards looking for these new sources of qualitative information and towards systematizing it in a way that allows us to make full use of it. By incorporating specific geographic information on the location of the gravesites, we will be able to improve our model’s capabilities of predicting the specific locations where new searches should be conducted.
To achieve our goal, we issue a call to association of families in search of missing relatives, organizations in civil society, academic institutions, government agencies and other actors interested in helping us in this task to contact through firstname.lastname@example.org or email@example.com. To the extent that we include new data into our model, we will be able to obtain better predictions and provide better, more accurate information to shape the design of public policy related to the search of missing people in Mexico.
 We thank the Comisión Mexicana de Defensa y Promoción de Derechos Humanos and the journalist, Darwin Franco, for the updated information they shared with us pertaining to the identification of clandestine graves in 2015 and 2016 by local and state district attorneys. The updated results are possible thanks to this new information