How to Become a Data Scientist: My Lessons at HRDAG

I met Patrick Ball almost three years ago at a conference about transitional justice. At that time, I was finishing my masters degree in economics and working as a researcher for Dejusticia, an NGO based in Colombia. Colombia had signed the Peace Agreement just a few months earlier, and the media were covering a new kind of violence: the killings of social movement leaders. The big question everyone was trying to solve was: how many were being killed? I didn’t know it at the time, but sharing my thoughts about this with Patrick would be the start of a life-changing adventure.

When I went back to Colombia, HRDAG and Dejusticia partnered, in search of an answer to my question. Patrick explained the statistical method we were going to use, and I was in charge of collecting the data and giving him a database to work with. We published our first report one year after meeting. That experience introduced me to a completely new world: programming. I had to figure out how to use git and RMarkdown for collaborating with Patrick. Even though I didn’t get to write code for analyzing the information, I had access to every git “commit” done by him, so I could see all his work. And after publishing, I wanted to learn more.

One year later I became a visiting analyst at HRDAG. I left my work in Colombia for three months to live in San Francisco and continue learning from Patrick. At the time, I didn’t know I was going to meet such incredible people, who would help me through the journey. On my first day, Megan PriceCamille Fassett  and Patrick introduced me to HRDAG. We spend about an hour talking about workflow. At the time it did not seem very efficient to have multiple tasks, especially in cases where they could be really short, but now I get it. HRDAG’s workflow allows them to work according to their four principles: reproducibility, scalability, transparency, and auditability. When we have small tasks it makes it easier to find mistakes and to ensure that our results are not wrong. It also allows us, as Gus Brocchini said in a marvelous speech, to go back to projects that are many years old and comprehend them.

The rest of my day was dedicated to what they call “basic setup.” And I smile at the word “basic,” because for me it was not basic at all. Camille spent the whole evening helping me get my computer ready. The basic toolkit consists of four tools: SSH, vim, the terminal and git. Out of those four, I only knew git (and I knew how to use it just from the webpage, so I actually knew GitHub).

The next day we made a goal for me: to have an update of our previous report in the next three months. It sounded easy, but it was not. Patrick, Camille, and I sat down every morning for “code reviews,” a space where we would project the code each of us had written the day before and Patrick would help us make it better. In the beginning it was a frustrating process. I spent hours coming out with a code for a particular task and then Patrick would ask me to do it all over again just using tools from the tidyverse meta-package. Again, I didn’t get it at the time, but now I do. A data scientist should know her toolbox and should improve her tools. If we specialize in a collection of packages, such as the tidyverse, we will feel comfortable using them and learn how to exploit them. All of this happened while I was trying to learn how to leave the R studio environment and just using the command line, which seemed really expensive in terms of time and effort. Remember, I only had three months.

HRDAG’s environment made it clear to me that I did not have to learn how to transition from my mix of packages and from the environment alone. I relied on Tarak Shah for this. I would bomb him with questions, and he always had time to answer them. He ended up being one of the greatest teachers I have ever had. But he was not the only one. Megan was also available and would find time in her schedule for sitting down and reading my code, as Patrick did. I knew all of them had tons of work, but I never felt ashamed of asking.

In the end, I made it. I was able to finish the update of our previous report. It was not easy, but it was worth it. Now, I know that I want to continue with this path and that I will continue learning to become a data scientist. During the process I learned at least five lessons, that I hope will be useful for other people beginning this journey:

  1. Read more code! Patrick said this to me at least every two days. When you are learning how to program it is like learning a language. There is no better way than practicing by reading (and writing).
  2. Find a mentor (if you are lucky enough, you will end up with more than one). It is useless to read code in stackoverflow. The thing with code is that you can accomplish what you are trying to do in many different ways. Therefore, you need guidance on what good code looks like. Having Patrick and Tarak share their code and sit down every day with me to improve mine taught me how to write a cleaner code and how to improve my skills by specializing in particular packages.
  3. Don’t be afraid to ask. Patrick would always say that in the very beginning, I should spend five minutes trying to figure out something. As I learned more, I needed to spend 10 minutes, and later, 15. Why? Because in the beginning what can take me up to three hours will probably take someone like Megan 30 seconds. Ask to have a more efficient learning process. But don’t ask immediately: do some research first.
  4. Use HRDAG’s workflow. When you have small tasks, you will not only have the benefits I explained before, but you will also feel fulfilled. You will feel that every time you finish a task you are getting closer to your goal. It will encourage you.
  5. Find a friend. Camille and I write in different languages, so we were not able to help each other in our programming skills. Learning how to program can be really frustrating and you will most likely feel really dumb many times. Find someone to walk with you through this journey.

I look forward to becoming the best data scientist I can be. And I am sure that my path will continue crossing with the amazing individuals I met at HRDAG. Now, I will be a data scientist at the Truth Commission in Colombia. I will use the skills and culture I learned from HRDAG’s team to contribute to understand how the conflict has affected the people in my country. Through data, and with an interdisciplinary team, I will contribute to truth.

Image: Scott Stevens from the Transitional Justice Working Group; Patrick Ball, and Valentina Rozo Ángel at a TJWG conf in Seoul.

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.