In Pursuit of Excellent Data Processing
When I started working at the Invisible Institute, the issue of data processing organization quickly came to the forefront. I was tasked with creating the data backbone of the Citizens Police Data Project, while ensuring it would be scalable, transparent, and easy to maintain. Before I began this project, I did not have a strict framework—but then serendipitously I met Patrick Ball and subsequently learned about Principled Data Processing. After many emails and a few phone conversations, he invited me out to San Francisco to spend a week doing code review and training at HRDAG. During my visit, not only did I get to know the wonderful people at HRDAG and acquire a healthy addiction to sparkling water with lemon, but I also gained a significantly better understanding of technical methods in data processing, how to be a better coder in general, and how to empirically test the assumptions I make while processing data.
On my first day, we laid out the goals I had for the week, and proceeded to ensure each one had been addressed by the end of my trip. This began by going over the method I had developed for doing iterative pairwise matching on individual officers identified in distinct datasets with some overlapping information. Patrick offered a more rigorous and cogent way of describing the method—I did not have such a succinct description upon arrival. Additionally, he had me identify which assumptions I was implicitly making, and he demonstrated the ways in which I could empirically test these assumptions that would implicitly determine my method of choice. This exercise, in particular, changed how I approached cleaning opaque administrative datasets, causing me to commit to identifying my implicit assumptions and to think more analytically when I approach problems with ambiguous solutions.
Another take-away from this trip was that tests are better than comments. Patrick’s assertion that documentation is almost never maintained became painfully obvious when I found multiple lines of comments being at odds with my own code. Rather than arduously maintaining line-by-line comments that have little bearing on actual functionality, Patrick demonstrated the pragmatic approach of using tests embedded in docstrings for simple functions with little setup and test_ files for functions with required setup and tear down. Writing tests not only expounds the precise functionality of the code to readers, but it also allowed me to ensure that my functions would not unexpectedly return undesirable results if I extended their application beyond my original intent.
Beyond the code review and training with Patrick, I had the pleasure of interacting with other HRDAG staff members—and I got to meet Patrick’s dog, who took little interest in me, which, due to our similarity in size and weight, was probably for the best. I spent time upping my Vim-game and getting an up-close look at their project construction with Agrima. With Kristian, I had the chance to talk with someone who has worked extensively on criminal justice data, and she offered ample advice on how to detect underreporting in administrative datasets and tips on working with dynamic social networks. Executive director Megan Price walked me through their method of matching records in their Syria project. In my many conversations with Suzanne, I learned the most scenic routes back to my hotel, how to spend my downtime in SF, some delectable vegetarian recipes, and crucial difference between a regular and super burrito. With their precocious summer intern, Gus, I learned how to set up snippets for easy file creation and YouCompleteMe with Neovim.
While my trip was too short (four days of Mission burritos and Hetch Hetchy water is never enough), I learned (and relearned) a significant amount about data processing and best practices, and I picked up a few nifty tools/Vim plugins along the way. But perhaps most unexpected was the admiration I gained for the people at HRDAG and the work they do. It was inspiring to meet a group of intelligent people dedicated to rigorous data analysis in pursuit of a more just world.
Editor’s note: At HRDAG we feel that some of our most important work involves training the next generation of data scientists working in the human rights field. To read more about what our interns learn while at HRDAG, check out these posts by Phil Neff (2016), Andre Stephens (2016) and Gus Brocchini (2017).