Protecting the Privacy of Whistle-Blowers: The Staten Island Files

When Tarak Shah began to dig into a project that would identify police officers accused of misconduct, he realized that he needed a better understanding of privacy. The project, known as the Staten Island Files, involved the online publication of hundreds of complaints naming New York City Police Department officers and their alleged offenses. The complaints, however, contained sensitive information that could Redacted be used to identify the people reporting the offenses. A breach of privacy could present a substantial risk, as some of the people who reported offenses lived in the same neighborhoods patrolled by the accused officers, and they feared reprisals if their identities were discovered. 

Protecting the witnesses’ privacy was paramount. Not only did the files—upwards of 5,000 scanned PDFs—contain the names of people reporting offenses, they also contained court case numbers and names of defense attorneys, all of which could potentially link to the person making the complaint. 

“It’s insufficient to just talk about privacy violations,” says Tarak, a statistician at HRDAG. “People who make allegations against police are at real risk of harm,People who make allegations against police are at real risk of harm, in a not-abstract way.
Tarak Shah, HRDAG data scientist
in a not-abstract way. This is not the same kind of privacy issue as someone getting my credit card data.”

With the stakes so high, Tarak spent a lot of time listening to partners to learn what they needed before he created the tools for processing the thousands of files. “I wasn’t super fluent in the ways that privacy can be broken, and I learned a lot,” he says.

He became involved in the Staten Island Files project when Ford Foundation tech fellow, Cynthia Conti-Cook, connected him with investigative journalist George Joseph, who needed help preparing the files for publication. Tarak conferred with him and his associate Luca Powell, a data journalist co-writing the report, to find out specifically what they needed help with. It turned out that they had acquired the files through a FOIA request, and the files had already been redacted—but Joseph and Powell wanted to prep the documents with an even higher standard of redaction, to protect sensitive information. 

“This was not a legal liability issue,” says Tarak. “This was about social responsibility and ethical liability.”

Tarak used machine learning for the first round of redactions, building an automatic redaction tool that could handle the thousands of documents. But he was concerned about what’s considered an “acceptable error rate” in machine learning tools. “There was not an acceptable number of errors in this work,” he says.

The second round of review, after the machine-learning round, required a human review of every page slated for publication. In addition to building the automatic redaction tool, Tarak also built a named-entity extractor, which allowed him to index the documents by the names of the accused officers.

In the work he does at HRDAG, Tarak understands that there’s a certain cachet that comes with accessing and then publishing raw data. But when working with any partner who wants to publish “raw data,” he urges caution. Without proper processing or context, the documents may not prove useful. In addition to the risk of violating  privacy, if proper context is not provided, publishing a set of files could misinform, feed into the wrong narrative, or even be subject to a hostile interpretation. 

“We have to take responsibility for how raw data get used,” he says.

Illustration by David Peters.

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.