FAQ about the JEP-CEV-HRDAG data integration and statistical estimation project
1. Is there a single source of information about the victims of the armed conflict in Colombia?
No. Colombia has an extensive documentation process for victims of the armed conflict. Hundreds of institutions, victims’ organizations, and civil society organizations have focused their efforts on recording this information. However, each entity or organization develops their documentation process with its own limitations related to technical, logistical, social, and missionary capacities. No entity or organization is able to document the complete universe of victims. This is because it is impossible for them to reach every part of the country, find out exactly what happened, and identify all victims. In fact, if we compare two major databases—the Single Registry of Victims of the Special Administrative Unit for Comprehensive Care and Reparation for Victims (Registro Único de Víctimas de la Unidad Administrativa Especial para la Atención y Reparación Integral a las Víctimas) and the Observatory of Memory and Conflict of the National Center for Historical Memory (Observatorio de Memoria y Conflicto del Centro Nacional de Memoria Histórica)—large differences are found. Not only differences in the magnitude of violence they report, but also in relation to the definitions they use, and the periods and regions they covered.
We included 112 databases in the project. Some data came from state institutions, such as the National Center for Historical Memory, the Colombian Institute for Family Welfare (Instituto Colombiano de Bienestar Familiar), the Agency for Reincorporation and Normalization (Agencia para la Reincorporación y la Normalización), the National Institute of Legal Medicine and Forensic Sciences (Instituto Nacional de Medicina Legal y Ciencias Forenses) the Special Jurisdiction for Peace (Jurisdicción Especial para la Paz), the Office of the Attorney General of the Nation (Procuraduría General de la Nación), the National Police, (Policía Nacional), and the Single Registry of Victims, among others. Other data came from civil society organizations, such as the National Association of Peasant Users (Asociación Nacional de Usuarios Campesinos), the Colombia-Europe-United States Coordination (Coordinación Colombia Europa Estados Unidos), the Institute of Studies for Development and Peace (Instituto de Estudios para el Desarrollo y la Paz), the National Indigenous Organization of Colombia (Organización Nacional Indígena de Colombia), and Free Country (País Libre), among others.
2. Why not just use one of the existing databases?
All databases that document human rights violations, even those with a large number of records, have two types of missing data: missing fields and missing records. Missing fields are limited to the registered data. All the databases have different fields: some are related to the violent event (e.g., year, municipality, department, and alleged perpetrator), while others are related to the victim (e.g., name, age, gender, and ethnicity). However, when organizations or institutions document violence, they do not always have complete information, and it is normal for some fields to be empty for some records. There are many situations in which this can occur. For example, there may be a record of a victim, but the exact day that the violence occurred is not known, or you may know a victim’s full name, but not their ID number or their ethnicity. In addition to the fact that some fields are empty, it may happen that it has incorrect information, fields that are filled in may have incorrect information due to transcription or spelling errors. Records are missing when information about a victim is not documented, which can happen for many reasons. For example, a victim or their relatives may not report violence due to fear. In other cases violence occurs in locations that documentation groups cannot reach or victims are documented as suffering from one type of violence when they really suffered another. Additionally, bodies that are found may not be identified, and other bodies were thrown into rivers or common graves with no record. Unlike the first type of missing information, we do not know how many records are missing entirely. To address the two types of missing information, it was necessary to use statistical modeling: statistical imputation methods to complete missing fields, multiple systems estimation (capture-recapture) to estimate the number of missing records.
3. What sources of information were used in the project?
We used 112 databases provided by 44 state institutions, victims’ organizations, and civil society organizations.
4. Who participated in the project?
The project was developed by the Special Jurisdiction for Peace (Jurisdicción Especial para la Paz – JEP), the Commission for the Clarification of Truth, Justice, Reintegration and Non-Repetition (Comisión para el Esclarecimiento de la Verdad, Justicia, Reintegración y No Repetición – CEV) and the Human Rights Data Analysis Group (HRDAG).
5. Is there a document detailing the project methodology?
Yes. The technical appendix presents the methods used in the project.
6. If we put all the databases together, would we be counting more victims when one victim is registered in various sources of information?
It depends. The direct union of all sources without exhaustive data preparation and identification of unique victims can result in double counting. For this reason, we used semi-supervised record linkage, also known as “data deduplication,” techniques to ensure that information was integrated correctly across sources resulting in a single record for each unique victim.
7. Can’t the registries be joined simply by the victim’s ID number?
No. Not all registries have an ID number documented and, if they did, transcription errors are common. The lack of a unique identifier is the reason that it is necessary to use machine learning methods to deduplicate the records.
8. What sources of information did you link?
We integrated 112 databases provided by 44 state institutions, victims’ organizations and civil society organizations.
9. Do the integrated information sources contain all kinds of violence?
No. The project is focused on five types of violence: forced disappearance, forced displacement, homicide, child recruitment, and kidnapping.
10. What method did you use to do the record linkage?
We used 112 databases in the project. Since the same victim can be registered in more than one database or multiple times within the same database, it is necessary to link the records across databases. This process is also known as “deduplication” and it results in a single record for each victim. This avoids double counting and identifies each source that the victim was documented by. Given that we have trillions of possible pairs of records, it is not possible to link the records by hand. Instead, we use a semi-supervised machine learning approach, in which the model learns to link records based on examples given by an expert who manually reviews records. In the case of the project, the expert is known as an “oracle.” The oracle for this project was Michelle Dukich, who has dedicated her career to identifying pairs of records in different languages. For example, the oracle determines Juan Pérez, murdered on September 1, 2020 according to database A is the same victim as Juano Peres, murdered on September 1, 2020 according to database B. To do this, the oracle uses their intuition to establish whether or not this pair of records refers to the same person: due to their phonetic similarity or another criteria. The name alone is not always sufficient to determine whether two records refer to the same victim. For example, there are many cases where two different people have the same name. The oracle is able to identify that they are two different people using their intuition, comparing, for example, the date or the place of the events. Then, the oracle analyzes a part of the records and classifies the pairs of records into two groups: i) given their shared conditions (identification numbers, names, dates, etc.) they surely are the same victim; and ii) those who are surely not the same victim. The logic used by the oracle to define whether or not two records correspond to the same person was translated into more than 60 criteria such as phonetic similarity, differences between the dates of violent events, matches of the municipalities or departments, names of the victim in a different order, among others, to define whether record pairs refer to the same victim or not. Based on these criteria and the 2,799,671 pairs of records analyzed by the oracle, the model was trained with the aim of imitating the decisions made by the oracle.
11. How do we know that the oracle’s decisions were correct?
A possible concern is that the training data was labeled incorrectly. In December 2021, five members of the Information Analysis Group at the Special Jurisdiction for Peace reviewed the data labeled by the oracle. Members analyzed 230,582, 185,187, 409,301, 197,983, and 354,279 record pairs, respectively. The decisions made by the analyst were compared with those made by the oracle. The results showed that the inter-rater reliability, that is, the proportion of agreement, was greater than 0.9 in four of the cases, while in the fifth case it was 0.76. In analyzing the records in which there were differences between the analysts and the oracle, we found that the oracle had identified pairs of records that referred to the same victim that the analysts had not. The analysts recognized that in these cases the records referred to the same person.
12. Do all sources of information have complete information about the victims, such as age, sex, ethnicity, or the group allegedly responsible for the violence?
No. By linking the records, we show that not all the victims complete information. This type of missing information is known as “missing fields” and is limited to the registered data. All the databases have different fields: some are related to the violent event (such as the year, municipality, department, and alleged perpetrator), while others are related to the victim (name, surname, age, gender, and ethnicity, among others). However, when organizations or institutions document violence, they do not always have complete information. It is normal for some fields to be empty in some records. There are many situations in which this can occur. For example, there may be a record of a victim, but the exact day that the violence occurred is not known, or you may know a victim’s full name, but not their ID number or their ethnicity. In addition to the fact that some fields are empty, it may happen that it has incorrect information, fields that are filled in may have incorrect information due to transcription or spelling errors. The fields that we keep in each of the integrated databases are: sex, ethnicity, age, alleged perpetrator and municipality.
13. Why not just work with the records that have the complete information?
The variables with missing fields are: sex, ethnicity, age, alleged perpetrator, municipality, the indicator for whether a violent event is related to the armed conflict, and the indicator for whether a disappearance was an enforced disappearance. In statistics, the action of “completing” the missing fields is known as imputation. We will refer to this process as “statistical imputation” to avoid confusion with legal terms. There are different strategies for statistical imputation.One possibility is to delete records that do not have complete information. However, this approach has at least two problems. First, it assumes that the cases with missing information are similar in their characteristics to those that are not missing information. We can think about the consequences of this with variables such as the department in the case of homicide. This assumption would imply that homicide victims for whom the department where the events occurred are not known have similar characteristics to those that are known. This is not necessarily true, since it may happen that for different reasons homicide victims in one department are reported more than others. Second, it implies losing a high percentage of the information available, especially for homicide and enforced disappearance, given the high percentage of records without alleged perpetrators in the databases. Another solution could be to keep the records, but exclude those that do not have complete information for a particular analysis. However, this approach also ignores that the records that are missing the information are different from those for which the information is not missing. For example, if the alleged perpetrators differed between the records that were missing alleged perpetrator information and those that were not, we would draw different conclusions about the conflict. It is necessary to impute the missing fields using statistical models.
14. How can the missing information be filled in?
The variables with missing fields are: sex, ethnicity, age, alleged perpetrator, municipality, the indicator for whether a violent event is related to the armed conflict, and the indicator for whether a disappearance was an enforced disappearance. In statistics, the action of “completing” the missing fields is known as imputation. We will refer to this process as “statistical imputation” to avoid confusion with legal terms. One might think that the missing values of a variable follow a similar distribution to the observed values of the variable conditioned on all of the observed values in the record. That is, we assume that a record that is missing information about a particular field is probably similar to a record that contains similar values for other characteristics. This assumption is known as “missing at random” (MAR). This is a plausible assumption in our case and indicates that the value for a missing field should be taken from records that are similar on other variables. This implies statistically imputing the missing fields using only the available variables. For example, it would imply statistically imputing the sex of the victim based on the department, age, year and ethnicity of the victim. This would be challenging for the model, so we created “support variables” to provide more information to the imputation model. The original databases include much more information that we could not standardize for the record linkage. Some databases document, for example, the profession of the victim. Others have information about the weapon that was used or the exact location where violence took place. Many databases have free text fields where they describe details about the violent events. All of this additional information could help improve the quality of the statistical imputation results. The support variables treat all of the heterogeneous information contained in the original records as a string of text and uses a neural network to extract latent information from the original records that is correlated with the variables we want to impute. To statistically impute the missing fields, we used a method known as multiple imputation with a fully conditional specification. This method begins by filling in the missing values with a random value. Then the missing values in each variable containing missing information is predicted using the information in all the other columns, including the support variables. Each of the other columns is then predicted sequentially using all of the other columns, including those that have been imputed already. The final completed dataset is known as a replicate. Each replicate has a random component, meaning that each replicate is slightly different from the others, reflecting the uncertainty of the imputation. In our case, the donor record from which the value used to fill in the missing information changes for each replicate. This exercise is then repeated multiple times with different initial column orderings. We repeated this process 10 times.
15. If there are 10 replicates of the statistical imputation, how are their results combined?
We use the standard approach for combining results from multiply imputed datasets that assume the results are normally distributed and then use the laws of total expectation and total variance to derive a point estimate of the mean and the associated approximate 95% credible interval. This approach is outlined in detail in Section 18.2 of Bayesian Data Analysis. Results from multiple systems estimations generally are not normally distributed, but applying a log transformation makes the posterior distributions approximately normal.
16. Do we have the full universe of victims after deduplicating the records and imputing missing record information?
No. There may be victims who have not been documented in any database. For example, because the victim or their relatives are afraid of retaliation and do not file a complaint. It is also possible that there is no record of a violent act because the fate of the victim is unknown or because the violence took place in a remote location that documentation groups could not access. This underreporting can be addressed using a statistical technique known as multiple systems estimation (also known as capture-recapture in some disciplines).
17. How do you find the number of victims of the armed conflict?
There is no single number of victims of the armed conflict. Since there may be victims who were never recorded, we cannot be certain of a single number. That is, there is uncertainty. Uncertainty in statistics is called “variance” and implies that the estimates have a range of possible values. The estimates of the number of victims are calculated using a method known as multiple systems estimation, which is a family of statistical models that have been used to study human and animal populations since the early 1780s. The idea behind this method is as follows: imagine two dark rooms. We want to know their sizes, but we can’t see inside them, and the only tool we have for exploring their sizes is a handful of rubber balls. The rubber balls have the special property that they don’t make any sound when they hit the walls, ceiling, or floor, but they do make a small noise (click) when they hit each other. We throw the rubber balls in the first room and hear many clicks. We then take the balls and throw them into the second room with the same force. Now we hear clicks, but less frequently. We conclude that the second room is larger because the rubber balls spread out more and therefore collided less frequently. In the vocabulary of our data, the size of the room is the size of the population of victims and each ball is a different data source. When two or more of the sources document the same victim, it is as if two of the balls collided and made a click. We then use these documentation patterns to estimate the size of the total population of victims of a specific event, even those that were never documented in our sources (underreporting).
18. What certainty is there regarding the number of victims estimated in the armed conflict?
An estimate based on partial information has a degree of uncertainty (“variance” or “margin of error” in statistical terms). What does uncertainty mean in the context of statistics? It means that we aim to construct a range that includes the truth, rather than estimate a single number. People often mistrust ranges because they do not yield a single answer. However, this mistrust ignores the facts that if we calculated descriptive statistics using only the observed data we would not learn about patterns of violence, but rather about patterns of documentation. There is no variance in this case because we know what has been documented with certainty, however, there would be an immeasurable bias. The violence that was documented does not reflect reality, but there would be no way to calculate how different the true violence was from what was documented. Thanks to statistical estimation, it is possible to reduce the uncertainty in the number of undocumented victims from a number of an unknown magnitude to a measurable and interpretable range. This range allows us to examine patterns of violence and has a fundamental characteristic: although all the values in the range are possible, the values close to the mean are more probable than those at the extremes, a characteristic known as “the shape or structure of the uncertainty”. For this reason, the mean values of the estimates are presented in the chapter ‘You will not kill’ in the Final Report.
19. How are the estimation results interpreted?
The estimates we present have an approximate 95% credible interval. In addition, the estimates have a characteristic known as “shape or structure of uncertainty”, meaning that values close to the average are more probable than those of the extremes. So while any of the values in the range are possible, the true value is most likely to be the mean.
20. Does the project answer all the questions about the victims of the armed conflict?
No. The project was limited to five types of victimization: forced disappearances, forced displacement, homicides, child recruitment, and kidnapping. In addition, there are a number of limitations. For example, we may analyze by sex, ethnicity, and age of the victim, but not by other characteristics.
21. If another model is used, can different results be obtained?
Yes. As in any research project, it was necessary to make a series of decisions. As we took one, possible paths were opened that would lead to other results. We explain other possible paths in the technical appendix, but we believe we made the best decisions possible based on the current state of the science and the knowledge of experts on the conflict.
22. Is it possible to distinguish between combatants and civilians?
Not with the data that was available to us for the project. Very few of the data sources had information about whether the victim was a civilian or a combatant. The only characteristics that were documented across data sources were the victims’ sex, ethnicity, and age. We can also do analyses based on the department and municipality where the violence occurred and the alleged perpetrator.
23. Did you use information of data sources such as the National Police or the National Institute of Legal Medicine and Forensic Sciences?
Yes. We included information from the Office of the Attorney General of the Nation, the National Institute of Legal Medicine and Forensic Sciences and the National Police, which are not limited to documenting victims of the armed conflict. We call this type of database “not specialized in the armed conflict” and it implies that there are two more variables: “it belongs to the conflict” and “it is forced disappearance”.