Project Goal

The Human Rights Data Analysis Group (HRDAG), in partnership with International Truth and Justice Project (ITJP), over the course of the past five months, has been collecting documented reports of the violence during the war between the Government of Sri Lanka (GoSL) and the Liberation Tigers of Tamil Eelam (LTTE) between 1983 and 2009. By collecting lists from different sources and by analyzing the patterns of overlaps from different lists, we can estimate the number of deaths and disappearances that did not appear on any of the lists.

To date, HRDAG has collected 39,842 records of people known to be dead and those who continue to be disappeared, from 23 sources. The data sources include spreadsheets, news accounts, web archives, and narratives from witnesses collected by journalists, concerned citizens and medical personnel. They also include datasets compiled by non-government organizations (NGOs), news reports in Sri Lankan media, websites, medical institutions, and local hospitals. Additionally, HRDAG has partnered with the ITJP to collect statements from Tamils around the globe about deaths and disappearances of which they have knowledge to make this as complete a database of the victims as possible. This is an ongoing effort.

It is imperative to understand that these figures include duplicate reports of the same victim, within and across the multiple sources. Once all sources have been processed into the database, we will conduct record linkage in order to de-duplicate the data, and use multiple systems estimation to estimate the total numbers of victims, including those whose names are not recorded in any of our data sources. We share these summaries of the data we’ve collected to date in order to outline the type of work we are doing, and to encourage groups inside and outside of Sri Lanka to continue to collect the names of the dead and disappeared. Including records from data we have received but not yet processed, and results from our ongoing community enumeration project, we expect the total number of records to be in the hundreds of thousands by the time we begin record linkage. Keep in mind that these are numbers of records, not numbers of victims. Our prior work estimating conflict deaths has demonstrated that, even when working with large databases, a substantial portion of deaths remain undocumented.

What comes next

HRDAG will continue collecting, translating, coding, and cleaning additional data sources. Once all sources have been processed into the database, we will use record linkage to de-duplicate and multiple systems estimation to estimate the number of victims not reported in any of the data sets. Release of a list of named individuals, including a Tamil version, with victim photos where possible, is a long range goal of this project. Eventually, the data can be used to create a digital archive of all the victims, where members of the Tamil community can learn what happened to their loved ones and, if they choose, send in photos if one is not currently on file. The hope of the Tamil survivors is to create a Memorial Archive to remember the victims and demand accountability from the Sri Lankan government.

In the next sections, we summarize the data we have collected so far. These are not meant to be summaries of the conflict – the data is an incomplete record of the conflict, and some places, times, or victims may be better represented than others due to the nature of the sources we’ve been able to find. The summaries may be useful, though, to identify gaps in coverage.

Spatial and temporal distributions of records

The number of records we’ve collected varies by year, reflecting both the dynamics of the conflict as well as the coverage of the data that we’ve been able to collect:

Number of records collected, by year. These are raw counts from data that includes duplicates

The records are concentrated in the north and east parts of the country. The following maps show the density of records of deaths and disappearances in the data sources (still including duplicates) during the latter part of the war:

Distribution by province of records of deaths and disappearances from 2002-2009. Data is incomplete and includes duplicates

Victim characteristics

69% of the records where victim sex is recorded are of male victims. 12% of records do not have the sex recorded. Furthermore, 42% of records include the victim’s age or the date-of-birth. Of these, 19% are records of victims under the age of 18.

Distribution of age and sex for records where neither variable is missing

Appendix

Data collection

Each type of document had its own issues that needed to be dealt with in order to be included in this effort. Some were books and paper documents that had to be scanned and then OCRd; others were scraped off of websites using Python code; yet others needed to be parsed out of legal testimonies and coded into spreadsheets; some newspaper sources online required manual review capture to be entered into a spreadsheet, while other smaller sets were manually entered. Even if a document was already in electronic form or a spreadsheet, it required careful review and adjustment to get the data into a format that was consistent across the datasets. Each source requires a heavy dose of manual review and coding and is meticulously tracked through each step of the process so that each record in our database can be traced back to its original source, down to the page and line number.

Data standardization and cleaning

Translations

Several data sources were received in Tamil and needed to be translated before being added to the database. HRDAG utilized Google Translation software and a small team of native speakers to accomplish this. The larger datasets in Tamil were put through the Google Translation software and then reviewed by hand and corrected by a native speaker where it failed to accurately translate. The smaller data sets were translated by hand by native speakers, which they translated directly into spreadsheets. All data is retained its original versions and language for archival purposes.

Data organization

Since each dataset collected had its own fields for data, they needed to be categorized and lined up to standard fields. Each source was pulled into a spreadsheet so that all files could be merged into one data set. Some data sets have extra fields which HRDAG may not use for matching purposes, but it was all retained as an archive within the original version of the data.

`NAME`

There were various issues with victim name fields across the data sources. Some of the projects listed multiple victims from the same family or event date within a single cell of a spreadsheet or table. Each victim needs their own record in the database, so the records were pulled apart and given the corresponding date, place, and violation types associated with the original record.

In some cases, the name field included additional information which we parsed out. For example, a title such as Mr., Mrs., Dr., Major, or Lt could be included with the name. In other cases, a relationship identifier such as wife, child, husband, sister, or daughter were included (especially in the cases where family groups of victims were listed in the same cell of data in the original source). There were also indications of occupation such as driver, farmer, or journalist. These extra details were removed from the name field and used to fill in other standard fields. For example, sex could be assigned if the name included Mr., Mrs., Miss, Ms., son, daughter, wife, etc. Title or role information in the name field, such as Dr., Major, farmer, or driver, was moved into the OCC (occupation) field.

We coded records that did not include a given or family name, or at least a nom de guerre, as UNNAMED. A record referencing a victim only by a relationship to someone else is considered an UNNAMED person but we retained that relationship detail in case it could later be matched to a named record that matched on the other key fields. Some name fields indicated UNNAMED groups or individuals. Groups were coded as UNNAMED and the count value for that group/individual was coded into the count field. These unknown groups cannot be linked or matched to other named records but they can show scale of an incident if no named records exist for a particular date and place.

`SEX`

As previously noted, in some cases, sex coding was inferred from information originally in other fields like Mr., Miss., Mrs., Miss, son, daughter, father, husband, wife, etc. Sex coding was also obtained from pictures of victims when included with original data sources. Finally, a native Tamil speaker was tasked with reviewing the names associated with records that didn’t include sex to determine sex for obviously male or female names. The common use of just a first initial, a single name, or only a nom de guerre made it impossible to code sex for every record. Going forward into the matching/record linkage process, sex will become known as more detailed records are linked and can fill in the missing fields.

`COUNT`

The count field was used in datasets where there were groups of UNNAMED victims. All records are presumed to have a count of one unless specifically coded in this field.

`OCC` (occupation)

This information came from the source material and in some cases by the very nature of the data. Many lists of victims from the LTTE memorial websites were coded with LTTE as their occupation and included rank and responsibility.

`VIOL_DATE`

Multiple date formats within and across data sources required the dates be parsed out meticulously into separate VIOL_DAY, VIOL_MONTH, and VIOL_YEAR fields to avoid issues with day and month fields being swapped with each other. Having the dates parsed out like this will also assist in the matching process when records lack specific month or day precision or records are missing date values.

`AGE`

Where datasets contained date of birth (DOB) information along with a violation date, but no age field, we calculated age at time of the violation.

`VIOL_LOC`

Information about where a violation occurred was a bit problematic in that each data source coded places differently. Some of the location information was gleaned from the source and title of the original source material as they were lists for a specific time and place. In some cases, location information was divided into a specific place - like a road intersection, at home, or at work - and included an administrative location like a village, city, district, or province. Since much of the data did not include the lower-level administration information, the focus was to get at least district and province level information on each location. This required mapping the lower level administrative place information into the correct district and province. Complicating the mapping to district and province were the many spelling variations of locations.

HRDAG created a place location standard mapping list from Fallingrain, a website source for places names around the world which HRDAG has utilized in the past. The information from Fallingrain included latitude and longitude coordinates of each of the places, which we used to create maps of the events.

After mapping locations in the database to the list of places from Fallingrain, we manually looked up over 2,600 unique unmatched violation locations to determine district and province. We used Google search to locate district and province for many of these records. Additionally, we employed AccuWeather and Wikipedia to determine district and province. AccuWeather had many of the smaller villages in their database, and utilizing the Zoom feature on the weather maps, staff were able to code to at least the province level, and by looking up nearby towns, could often determine the district as well.

Approximately 530 place names remained unmapped after using Google Search, Wikipedia, and AccuWeather. These unmapped place names affect 1176 individual records. Additionally, there are 2764 records with no location listed. In the past week, HRDAG has been provided with place lists created by Sri Lankan NGOs who had experienced similar issues. These new lists will be reviewed and implemented as HRDAG continues to locate and document new sources and testimonies.

Locations that were indicated to be somewhere in the sea or ocean were all coded as SEA. There were also records with place names that included southern India coastal cities. Those will need to be investigated to see if that is where someone died after having been injured in Sri Lanka while fleeing the violence. The VIOL_LOC for the Indian places are coded as INDIA. There were a handful of records with a location outside of the region completetly, such as one from Germany – we will review these manually.

`VIOL_TYPE` (Violation type)

Initially, HRDAG focused on records detailing death/killings and disappearances (including missing, arrested, and detained). As the data collection process progressed, the nature of the conflict became clearer, and it was deemed appropriate to include other violation types noted in the data sources. Violation coding was initially from the original source material. Later, we combined several violation types into like groups: “killing” was combined with “death”, “missing” was combined with “disappeared”, and “detained” was included with “arrested”. Rape and torture were included in the “injured” violation type unless those violations resulted in death, in which case they were coded as “death” with the original violation type noted in the VIOL_CAUSE field. The relocations of people from hospitals to other locations (IDP camps or so-called no fire zones) were coded as injured as it appeared they were in the hospital due to being injured before being moved.

`VIOL_CAUSE`

This field includes additional details about the violation, including the original codings of rape and torture, what party was responsible for the violation (SLA, SLN, kfir, bombing, artillery, shot, etc). While this information does not have a set value list, it was deemed important information to retain for possible use later in the analysis and archival documentation phase of the project.

Audits

HRDAG conducted multiple audits on the various fields to ensure the integrity of the data, and manually reviewed and, when necessary, corrected records after comparing the record to the original source. These audits included:

viol_day: is not greater than 31
viol_month: is not greater than 12
viol_year: falls within the range of the conflict

Records missing fields

Once matching/record linkage is completed, many of the records missing key fields such as VIOL_LOC, VIOL_TYPE, SEX, or AGE, may be linked to other records for that victim that includes those missing pieces.

Progress report on enumerating the deaths in the Sri Lankan Civil War

Michelle Dukich

Tarak Shah

15 May 2019

Project Goal

What comes next

Spatial and temporal distributions of records

Victim characteristics

Appendix

Data collection

Data standardization and cleaning

Translations

Data organization

`NAME`

`SEX`

`COUNT`

`OCC` (occupation)

`VIOL_DATE`

`AGE`

`VIOL_LOC`

`VIOL_TYPE` (Violation type)

`VIOL_CAUSE`

Audits

Records missing fields

Progress report on enumerating the deaths in the Sri Lankan Civil War

Michelle Dukich

Tarak Shah

15 May 2019

Project Goal

What comes next

Spatial and temporal distributions of records

Victim characteristics

Appendix

Data collection

Data standardization and cleaning

Translations

Data organization

NAME

SEX

COUNT

OCC (occupation)

VIOL_DATE

AGE

VIOL_LOC

VIOL_TYPE (Violation type)

VIOL_CAUSE

Audits

Records missing fields

`NAME`

`SEX`

`COUNT`

`OCC` (occupation)

`VIOL_DATE`

`AGE`

`VIOL_LOC`

`VIOL_TYPE` (Violation type)

`VIOL_CAUSE`