Learning a Modular, Auditable and Reproducible Workflow

When I first learned about Dr. Patrick Ball’s work contributing statistical evidence to the conviction of Slobodan Milosevic, I had been taking statistics courses in my graduate program at the Rollins School of Public Health at Emory University. I was immediately inspired. I mentioned Patrick’s work and that he had co-founded HRDAG with Dr. Megan Price to a mentor at Emory and found out how small the world is. My own mentor had also been one of Megan’s. I have had the pleasure of chatting with both Megan and Patrick over the last year and I was thrilled when Patrick asked if I would assist on an HRDAG project.

Having only formally learned SAS for a couple of years and dabbled in Python and R for a handful of months, HRDAG’s principled data processing workflow was initially intimidating. I admit to my R code looking very much like the amateur examples in Tarak’s recent post.

With the import of what our answer could mean for our client’s activism in mind and a great deal of Patrick’s guidance I shifted towards working mainly in the command line, learning Vim, and thinking of scripts in terms of a larger project architecture. For this project I used R Studio to prototype new ideas and Microsoft Visual Studio Code with a Vim extension to make it production-ready and push it to the project Github repository. As I worked, Patrick also encouraged me to keep notes on process. The modular nature of the workflow and use of Git allowed us to work on different parts of the project from across the country without worrying about out of date bits of code from an email chain. The following post summarizes those notes and the work that went into determining the answers to our client’s question.

At the risk of falling into the caveat fallacy, any conclusions we came to as a result of our analyses are contingent on these data being true and come with the caveat that we don’t know what we don’t know.

To begin the project, we started with a task. The file structure of each task is a microcosm of the entire project. Inputs enter the input folder, are acted upon by the code in the src folder, and are output to the output folder. In that same way, projects start with inputs, are acted upon by tasks in the workflow, and are output as results overall. Breaking up work into tasks facilitates the team’s collaboration across time zones and programming languages, one task could be performed in Python and another in R and neither analyst would have to even know. The task structure also results in projects which are auditable and reproducible. Anyone looking at the code can determine exactly how output in a given task was produced and anyone working on a task at anytime can pick it up and recreate the same output. This project included the following five tasks; import/clean/test/, and write/. Each with its own Makefile, input folder for files coming into the task, output folder for files produced by the task, and src folder for source code to do the task.

To make the goals of each task clear, each task has a Makefile. The Makefile serves as an outline of each task not only for our benefit, but mainly for the computer to read and execute. Each file can be run from a command line terminal simply by calling “make [name of task]/. Imagine being able to work on a group project with team members around the world, and still have an auditable trail of each person’s work. The Makefile acts as clear and traceable scaffolding. For more information on Makefiles check out Mike Bostock’s post and Jenny Bryan’s post. Writing and using Makefiles for the first time was probably one of the most challenging aspects of learning this workflow but it was extremely rewarding to see the whole project work with (practically) one click.

Read the full post here: From Scripts to Projects: Learning a Modular, Auditable, and Reproducible Workflow

Acknowledgements

Collaborating with HRDAG, specifically Dr. Patrick Ball, on this project has been an incredible learning opportunity. Throughout this project I began learning how to look at data analysis from the very macro (“How many tasks do I have?”) to the very micro (“Can I pipe this line?”) levels. I began to learn how to use the distance inherent in the job of looking at data points and the modularity of the workflow to help me stay focused. This has resulted in a fundamental shift in the way I work and I will continue to learn and improve in using this process as I move forward in my career. As a statistician I am used to having a comfortable amount of psychological distance from the people my work is impacting. Rather than seeing this distance in opposition to the ability to do strong work, it is absolutely necessary that I have it to do the best work technically possible. Working on this project was like a coding bootcamp and professional trial by fire in the best way possible. I learned more than I ever imagined about both the technical and personal aspects of statistics in human rights work and my dearest hope is that that the conclusions we came to were useful in supporting our client’s work on the ground.

The project architecture structure discussed in this post, with discrete tasks using Makefiles, was developed by Drs. Scott Weikart and Jeff Klingner and formalized around 2007. I am incredibly grateful for Tarak Shah’s motivating post on the dangers of .Rproj style of project management, for Dr. Amelia Hoover Green’s detailed post on what it means to clean and canonicalize data, Dr. Megan Price’s encouragement to work in this field, and Dr. Patrick Ball’s invitation to collaborate, his patient explanations, and detailed feedback throughout this project.