Training with HRDAG: Rules for Organizing Data and More

Andre Stephens in San Francisco, summer 2016

Andre Stephens in San Francisco, summer 2016

I had the pleasure of working with Patrick Ball at the HRDAG office in San Francisco for a week during summer 2016. I knew Patrick from two workshops he previously hosted at the University of Washington’s Centre for Human Rights (UWCHR). The workshops were indispensable to us at UWCHR as we worked to publish a number of datasets on human rights violations during the El Salvador Civil War.  The training was all the more helpful because the HRDAG team was so familiar with the data. As part of an impressive career which took him from Ethiopia and Kosovo to Haiti and El Salvador among others, Patrick himself had worked on gathering and analysing many of the datasets we had. In addition, we were fortunate to benefit from the extensive work done by our HRDAG colleague, Amelia Hoover Green, to address some of the more stubborn data processing issues.

My visit to HRDAG turned into one of the more fruitful experiences of my academic training. Patrick imparted a great deal in only a few days. He would often cap off his tutorials with with kernels of wisdom like “Don’t run a lawn mower over your foot if you don’t need to.” And while I’d managed up until then to avoid such issues in the world of gardening, those were precisely the kinds of painful gaffs I was making in the world of programming. (Did I really want to try to force those date strings into datetime format that could not reliably store partial dates?)

But perhaps what was most valuable about my visit was the entirely new way in which it challenged me think of data science project workflow. Patrick and the HRDAG team have developed a careful system of rules for organising just about every task that needs to be accomplished. The many advantages of this kind of organisation were apparent. Tasks already implemented upstream would not fail because of revisions downstream, or vice-versa. By maintaining a simple set of rules, a data scientist who was completely new to the project could readily grasp each operation, its dependencies, and how it fits into the wider workflow.

Self-documentation and reproducibility were built into the project at each step. This system had tremendous payoff in providing an efficient data management structure for our work at UWCHR, and would doubtlessly benefit any data science project. In working with Patrick and reviewing some of his publications, I discovered his data science skills, which were honed by analytic principles and experience working with rights violations data in diverse contexts and under all kinds of real-world constraints. I benefitted immensely from the training he offered.

With the framework for the project workflow in place, we needed to turn to the gritty work of cleaning, standardising and combining the dozens of datasets we had. Here again, Patrick and Amelia — along with my UWCHR colleague Phil Neff — provided crucial input. We confronted a number of issues in fixing the data. Having been gathered under difficult conditions, with limited resources and using technology from over two-and-a-half decades ago, the datasets needed considerable preparation before they would be suited for publication and study. A preliminary task was to get the data in less archaic file formats and to begin cleaning and normalising tables after determining their content and relational structure. We also needed to give considerable thought to the integrity of the data. How valid and consistent were they? Is a human rights violation, for instance, recorded as an incident of victimisation or as a larger attack that might contain a number of incidents? Do these measurement definitions vary across data sources?

Another set of challenges related to how to best represent the relational structure of data points and to determine the unit of observation for combined databases. A record might describe a unique victim, but this would collapse any number of violations in the same record if a victim suffered more than one violation. Likewise, a single violation record might contain several victims. Still, we could add another dimension to the problem: each victim and violation might be linked to any number of perpetrators and vice versa. Finally, we needed to consider important ethical questions. While some of the datasets we were using were already in the public domain, others were not. Did we want, for example, to publish records that identify hundreds of victims of sexual violence? While public accountability remains critical to combatting impunity, we had no way of knowing how this information would affect victims in El Salvador.

Before joining this project, I was familiar with the legacy of state violence in El Salvador and with the important work groups like UWCHR and HRDAG do in fighting for accountability and against impunity. I saw parallels in my own work, as a Sociology grad student, on the 2010 Tivoli Massacre which claimed the lives of 73 persons in my native Jamaica. While I was a research assistant with UWCHR, I did archival work on the May 1981 Rio Lempa massacre on the El Salvador-Honduras border. I also closely followed the work of other members of UWCHR’s ‘Unfinished Sentences’ Project as they sued the CIA for the release of documents that could help identify perpetrators; uncovered new information about past violations; and collaborated with several survivors, activists, jurists and academics on the frontlines of the struggle for human rights and justice.

I felt deeply honoured when I received the Ben E. Linder Fellowship which supported my work on the datasets and my visit to San Francisco. Ben Linder, a University of Washington alum, moved to Nicaragua in 1983 after earning his degree in mechanical engineering. There, he worked on projects to electrify parts of the country. Tragically, he was ambushed and killed by the Contras in 1987 while on one of these projects. Ben’s example reminds us of the commitments and sacrifices that so many make in struggles against injustice, and of courage and selflessness with which they do so.

Forgoing more lucrative opportunities in industry, Ben put his technical skills toward improving the welfare of others. For this reason, I believe that Ben would have been encouraged by the collaboration between UWCHR and HRDAG. Using the tools of data science, we were able to offer new and more systematic insight into crimes perpetrated during the El Salvador Civil War. Of course, raw numbers never quite reflect the horrors that took place. They do, however, help to tell a part of the story. These datasets are important because they allow us to see patterns that might identify perpetrators yet to face justice. They also memorialise victims by telling the truth of their experience and helping loved ones piece together what happened.