I’m a high school student who has been working with Patrick Ball here at HRDAG as an intern over the summer. I arrived here after asking my parents about what jobs existed that used data analysis to help people. Through our friend and human rights lawyer Kathy Roberts at the CJA, I was introduced to Patrick and have been lucky enough to spend my time here learning what data analysis looks like as an actual job. Here’s an overview of the tools and methods I have learned so far at HRDAG.
Prior to this summer, I didn’t realize why data science is so hard. Because of the number of people who work on our projects and the significance of the work we do, one of the first things I learned was the importance of guarding against mistakes. It turns out when you are dealing with a lot of sensitive data there are a lot of ways things can go wrong. We can accidentally delete data files, there can be bugs in code that’s 10 years old, we can put things in the wrong places, and any number of other missteps. Because of this, HRDAG has created and adopted systems that help keep us from messing up. In this post I’m going to share what I think is great about four of our key tools: our File Structure, our Editors, Version Control, and Eleanor (our server).
Ancient Romans organized all their military camps in exactly the same way. That way, soldiers from all over the empire could march into any camp and know exactly where to go and how to get to work. At HRDAG, we take the same approach to our file structure.
We organize everything into tasks: small segments of work that do something specific (for more on this see Patrick Ball’s post from last year, “The Task is a Quantum of Workflow”).
All tasks have these three directories: input, output, and src (src is how source is spelled in Unix). The input and output directories hold the data, and the src directory holds the code. This makes it super easy to find whatever you need.
There are other directories that aren’t used for every task, but have specified purposes as well:
- hand, which holds hand-written but machine-read files
- doc, which holds documentation
- frozen, which holds data files that have been processed without code (These are usually files with very old formats that our current tools don’t support.)
- note, which holds prototype code. Prototype code in usually runs interactively so that we can see outputs after every block of code.
This standard structure makes it easy to find whatever we need. But more importantly, it simplifies our thought processes. When we don’t need to worry about where things go or where they are, we can worry about the real work. We make fewer careless errors and we don’t have to think about where to write that or where this came from.
We also have a higher level file structure: individual directories hold all the data cleanup, match directories match records, MSE directories contain the Multiple Systems Estimation tasks, and more. This higher level organization is incredibly useful for finding things in other projects or old code from current projects. I was amazed when Patrick found code in seconds from a 2008 project he hadn’t touched in years.
We also use symlinks. Symlinks are a way of putting the same file in multiple places. You start with the original file, and then you can symlink it into a new directory. The symlink will tell you where it points to, so it’s easy to tell what the file is. Our most common use for this is symlinking data files from the output of one task into the input of another. This way we can both see where it comes from and keep it up to date in case we change something upstream.
One thing I learned this summer is that among programmers, there is a bit of a war about which editor is best. The two big contenders are emacs and vim. I won’t get into which is objectively better (vim), but I will say that which editor you use is incredibly important. Tools like TextEdit don’t work because they don’t allow easy movement and reformatting. Editors are all about reducing friction between ideas and code. When you have an idea for a way to make your code better, it often involves multiple modifications in multiple places, and the sooner you can get it down on screen, the less chance something slips your mind.
We also need a way to keep track of our changes and keep everyone up to date. For this we use two different systems: git and snap.
We use git to store code, documentation, remapping keys, and everything else that isn’t data. We don’t keep data in git for two reasons:
- We compress our data into raw binary, so git thinks that a change in one byte changes the whole file. This results in us having to store two copies of almost the same file, instead of the original file and a key to what has changed, and
- We can’t put sensitive data on GitHub.
Of the issues, No. 2 is the bigger problem. So we use a tool called snap, which was written by Scott Weikart and Patrick. snap solves both these problems. If you want to learn more about snap, you can go to the repo on GitHub. snap is hosted on Eleanor, our local server, so the data never leave our sight. Which brings me to the fourth tool…
At HRDAG, we all work on our own computers, but we’re usually actually working on eleanor, our computer named after Anna Eleanor Roosevelt, the single person most responsible for the Universal Declaration of Human Rights. Eleanor (the computer) has:
- Tons of fast storage, which means that all of our read and write operations are done in a few seconds.
- Lots of computing power, which means that the complex simulations we occasionally run take hours, not days like they would on one of our desktop computers.
- snap and our data files, which means that all the snap operations are as fast as a disk write.
It may seem that these tools are far removed from the important work that HRDAG does. But I’ve learned a lot about how complex problems can be broken down into manageable tasks. As a young person, I’m starting to think that I understand the challenges of the world, but I don’t always know how to find solutions for them. Seeing how the team at HRDAG takes a systematic approach to human rights has made me think about problems I see in my school and community and how I can apply similar tactics and tools to effect change.
That’s my newcomer’s summary of how the HRDAG tools make our work faster and safer. We have a great standard file structure, we use the best editor, we use separate version control systems for code and data, and we have a great server.