Note: some of the advice in this post echoes that of Jenny Bryan in this excellent article. One difference is that we treat each discrete task as a self-contained piece, whereas she treats the entire project as one unit. Another distinction is that for us, a given project usually isn’t all written in R. Members of our team prefer to code in different languages. This post is written from the perspective of a (primarily) R user on a multilingual team.
At HRDAG, we organize projects as collections of self-contained tasks to be performed in order, together constituting a data analysis pipeline. For instance, a project structure might look like this:
.
├── import
│ ├── Makefile
│ ├── input
│ ├── output
│ └── src
├── clean
│ ├── Makefile
│ ├── input
│ ├── output
│ └── src
├── model
│ ├── Makefile
│ ├── input
│ ├── output
│ └── src
└── write
├── Makefile
├── input
├── output
└── src
Each task is its own little mini-project, with information about dependencies and instructions for producing output recorded in the Makefile. Project structures may become more complicated as tasks come to depend on multiple upstream tasks as inputs, or are broken up into discrete sub-tasks, and so on.
This project style, largely developed by Drs. Scott Weikart and Jeff Klingner around 2007, is not only self-documenting and easy to maintain, it also facilitates collaboration for us. We are a multilingual shop, with much of our code pretty evenly split between R and Python, along with a dollop of bash and smatterings of Julia, SQL, C, markdown languages, and various others. With tasks split this way, someone developing a downstream task can update the upstream tasks without even having to know what language they are written in, just by running make
. For more on our project structure, see Patrick Ball’s blogpost The task is a quantum of workflow.
For more on the usefulness of Makefiles in data analysis projects, see this post from Mike Bostock.
Two of our highest priorities when organizing project code are for the code to be:
We should start by defining our terms
The definitions of reproducibility and auditability will vary depending on context. For us, at a minimum, we’d like to make sure that project can be run on our in-house Linux server via its Makefiles – e.g. cd import && make
, etc. Given that we often develop code in an interactive way via a REPL or a Jupyter notebook, this post will focus on some of the ways that code developed interactively, and meant as a prototype, falls short of being fully reproducible and auditable.
For HRDAG, production does not mean that code will be rolled out to millions of users. Rather, production code is code that we can stand behind, having assured ourselves through unit tests and several rounds of editing that results and conclusions are rigorously defensible.
As stated above, a concrete definition for “production” for us can be whether the code can be run on our in-house Linux server via its Makefiles. If that represents the letter of the law, then the spirit can be captured by the concept of empathy for the person tasked with reproducing or auditing the code. This empathy is valuble to cultivate because, if for no other reason, the person tasked with reproducing or auditing your code is often your future self.
Whenever code reads in some data from disk, or writes anything out to disk, we face a reproducibility-related challenge:
Code that relies on a relative filename and works in an interactive session, such as read.csv("mydata.csv")
, depends on what the working directory is at the time it is run. That working directory may be set automatically based on how the R session was initiated, meaning that you can run the exact same code on the exact same machine with the only difference being where you started your R session, and sometimes it will produce the correct output and sometimes it won’t, which fails any reasonable definition of reproducibility. Or alternatively it might be set explicitly using setwd
, which is not only inelegant, but also runs into problem (2).
Code that relies on an absolute filename, such as read.csv("/Users/tshah/projects/myproject/import/input/mydata.csv")
, depends on where the project happens to be on my machine. If I share the project with anyone else, that code that works for me will result in a “No such file or directory” error for anyone else, again failing most definitions of reproducibility.
Many RStudio users have taken to the .Rproj solution – share the project including the .Rproj file, and then as long as anyone wanting to run the script does so by opening the .Rproj file (which starts RStudio and sets the working directory to the location of the .Rproj file), everything should work. In our case, since each task is its own self-contained project, perhaps we can extend that practice, and have an RStudio project replete with .Rproj file for every task?
It is easy to forget, as an RStudio user, that not all of your collaborators use RStudio, or even code in R. The Rproj solution is specific to RStudio users. Furthermore, given that our own definition of reproducibility included being able to run the task via make
, we haven’t met our standard for reproducibility. In fact, this solution makes RStudio a project dependency. That is a high cost to pay, especially since, as mentioned above, the person trying to execute the task may not even be an R user. With the here package, it might seem like we can have all the benefits of .Rproj without the downside of requiring interactive execution and/or RStudio. But we’d like to have a solution that works regardless of programming language, and it feels like a bit much to ask a Python developer to include calls to R from inside their code just to get the paths right. That’s one reason I’ve been using git rev-parse --show-toplevel
. For example, here is a recent Makefile of mine. The goal of this task is to take some simulated “ground truth” data generated in an earlier task, and sample from it multiple times – mimicking the process of different groups on the ground collecting evidence of human rights abuses:
# Authors: TS
# Maintainers: TS
# Copyright: 2019, HRDAG, GPL v2 or later
# =========================================
# squareland/generate-reports/Makefile
HERE := $(shell git rev-parse --show-toplevel)
TRUTH := $(HERE)/generate-ground-truth/output/truth.feather
.PHONY: all clean
all: output/reports.feather
clean:
-rm -r output/*
output/reports.feather: \
src/observe.R \
$(TRUTH)
Rscript --vanilla $< \
--input=$(TRUTH) \
--seed=19481210 \
--nsource=4 \
--reports=$@
# done.
I’m able to effectively specify an absolute path for the input file, without fixing my script to my own personal directory structure.
Besides the language thing, there are some other differences between here::here()
and git rev-parse --show-toplevel
that are worth knowing about. here::here()
is more flexible: git rev-parse
requires the project to be a git repository, while here::here()
can determine a project root for git repositories as well as RStudio projects. Furthermore, if you add an empty file called .here
to the root of your project, here::here()
will use that to determine the project root, without needing git or Rstudio. Given that we always manage our code in git repositories, the git rev-parse
solution felt like the best fit. The solution you choose will depend on your own context.
Save workspace image? [y/n/c]:
Working on a given task may take days or weeks. Especially if some step in a task takes a long time – such as reading in a large text file, or performing some expensive string calculations on every data element – it’s tempting to save your workspace as you go, so that every time you return to work on the project, you’re able to start up where you last left off, without having to re-run some time-consuming code.
But an R workspace, saved in the course of an interactive session, may include a number of objects for which no code has been saved. When I’m working with an editor and an interactive R console, I often drop into the console to directly experiment with some code that I’m not sure will work. If that experimentation results in an object being saved in my workspace, and I continue saving and re-loading my workspace to continue working on a task, I may end up relying on an object that, if the script were run from a fresh R session, would not exist.
This sort of hidden state makes work unreplicatable and can be the source of serious bugs that are difficult to track down and fix. It affects not just REPL-based developers, but also Jupyter users – in the linked example, they describe a developer running code from a later cell that updates a variable value, then hopping back up to run code in an earlier cell. The results of that earlier cell now depend on code that does not appear until further down.
Rather than saving your workspace, start fresh R sessions. When testing that code works, run it via the Makefile (I’ll usually do a round of make clean
, which removes any output from the task, and then make
, effectively re-building the task from scratch). If some piece of the script takes a long time to run, which is slowing down your ability to develop, spin that piece into its own task. For instance, say your script starts by reading in a large text file and you are tempted to save your workspace in order to avoid waiting for the file to load every time you update your code. Create a new task that reads in the text file, gets all of the column types right, and then writes the output in a binary format that is fast to read/write. The next task can start by reading in the binary file.
R code developed interactively will sometimes contain code that looks like this:
cleaned_data <- process(raw_data)
cleaned_data
table(cleaned_data$category_code)
This code makes sense from the standpoint of interactive development. We’ll usually include lines like that because we’re checking that process
still works as expected, or that our logic to standardize category codes captured all of the variations that appeared in raw_data
. But code like that in a script that is called in batch mode via Rscript
, for instance from a Makefile, will just print output to the screen without stopping for you to review. It adds nothing to the script, and if for some reason process
is not working as expected, you may not find out unless/until you find some weird results downstream.
To make this code safe for production, we can step back and review why those two lines of code, that print output to the screen but don’t produce or modify anything, exist. In fact, they exist as informal unit tests. Interactively, I am checking that the number of rows and columns in cleaned_data
are as expected, or that category_code
only takes the expected values. Knowing that, I’ll replace this bit of prototyping code with something like:
cleaned_data <- process(raw_data)
stopifnot(nrow(cleaned_data) == 7926)
stopifnot(ncol(cleaned_data) == 14)
stopifnot(setequal(cleaned_data$category_code, c("A", "B", "C")))
Now, every time this task produces some output, I’ll know that cleaned_data
looks the way I expect it to, and I won’t have any unexpected surprises downstream. If any processing steps do not go as planned, the task will not produce any output, which means that no downstream task will be able to do anything until we’ve debugged. This is great! Having a loud error crash everything and refuse to continue until we’ve debugged is infinitely preferable to having code that runs without errors and silently produces something incorrect. If you can further craft your error messages in a way that helps the reader more easily identify the problem, even better! My error messages often look something like this:
expected <- 7926
actual <- nrow(cleaned_data)
if (expected != actual)
stop("Expected ", expected, " rows in cleaned_data, but got ", actual)
That way, if there is a problem when I run make
, I’ll see a message like:
Error: Expected 7926 rows in cleaned_data, but got 9134.
That gives me some starting points for debugging.
It works! Now quickly save and close before I screw anything up.
A lot of us data analysts came to coding after first spending time using predominantly GUI tools, such as Excel and SPSS. Sometimes our scripts can betray that history, as they come to look a lot like text-based recordings of every single “button” we pushed in the course of producing the analysis:
library(dplyr)
library(stringr)
library(somePackageIHeardAbout)
mydata <- read.csv("input/mydata.csv")
yrs <- unique(mydata$year)
mydata <- mydata %>% filter(type %in% c("A", "C")) %>%
select(name, month, year, address, type) %>%
mutate(name = str_to_upper(name), name = str_trim(name)) %>%
mutate(name = str_to_lower(name), name = str_replace_all(name, "\\s+", " "),
address = str_to_lower(address), address = str_replace_all(address, "\\s+", " "))
# TODO: check if any other months are misspelled in the original data
mydata <- mydata %>%
mutate(year = as.integer(year), month = ifelse(month == "Agust", "August", month),
month = ifelse(month == "Dec", "December", month))
saveRDS(mydata, "output/cleandata.rds")
Now, this code may satisfy our definition of reproducibility. But it is going to be very hard to audit. At the end of the process, is name
lower- or upper-case? What is the purpose of the yrs
object? Does somePackageIHeardAbout
ever get used? And as a script like this one gets longer and longer, it becomes increasingly difficult to see clearly what the script is doing, as everything blurs into a jumble of string-manipulations and one-off fixes.
As with any writing, good readable code will often require you to re-visit and re-organize and ruthlessly edit. The goal is not a video recording of every button you pressed to get some output, it is to describe in a human-readable way how the input becomes the output, and to provide instructions for the machine to produce that transformation. So after reviewing the above code with an editor’s eye, I might end up with:
library(dplyr)
library(stringr)
####
mydata <- read.csv("input/mydata.csv")
OK_TYPES <- c("A", "C")
####
####
clean_string <- function(string) {
out <- str_trim(string)
out <- str_to_lower(out)
out <- str_replace_all(out, "\\s+", " ")
out
}
fix_month <- function(month) {
GOOD_MONTHS <- c("January", "February", "March",
"April", "May", "June", "July",
"August", "September", "October",
"November", "December")
fixed_month <- case_when(month == "Agust" ~ "August",
month == "Dec" ~ "December",
TRUE ~ month)
stopifnot(all(fixed_month %in% GOOD_MONTHS))
fixed_month
}
####
mydata <- mydata %>%
filter(type %in% OK_TYPES) %>%
select(name, month, year, address, type) %>%
mutate(name = clean_string(name)
address = clean_string(address),
year = as.integer(year),
month = fix_month(month))
saveRDS(mydata, "output/cleandata.rds")
Note that the goal of editing wasn’t necessarily to make the script shorter, or even to improve its performance. It was just to make it easier for a human being to follow. As the saying goes, programs must be written for people to read, and only incidentally for machines to execute.
This post is built from HRDAG’s collective accumulated wisdom, picked up over decades of managing complex, long-lived, collaborative data analysis projects. It reflects that experience and countless conversations. The project structure discussed in this post, with discrete tasks using Makefiles, was developed by Drs. Scott Weikart and Jeff Klingner and formalized around 2007. I’m particularly grateful to Dr. Patrick Ball, who motivated the post, helped me think through the concepts, and provided valuable feedback.