How Machine Learning Protects Whistle-Blowers in Staten Island

When investigative journalist George Joseph needed help preparing for the online publication of hundreds of complaints naming New York City Police Department officers and their alleged offenses, Ford Foundation tech fellow Cynthia Conti-Cook connected him to HRDAG and Tarak Shah. The complaints that Joseph wanted to publish contained sensitive information that could be used to identify the people reporting the offenses. A breach of privacy could present a substantial risk, as some of the people who reported offenses lived in the same neighborhoods patrolled by the accused officers, and they feared reprisals if their identities were discovered. 

Tarak collaborated with Joseph and his associate Luca Powell, a data journalist co-writing the report, to better understand the data challenge. It turned out that they had acquired the files through a FOIA request, and the files had already been redacted—but Joseph and Powell wanted to prep the documents with an even higher standard of redaction, to protect sensitive information. 

Tarak used machine learning for the first round of redactions, building an automatic redaction tool that could handle the thousands of documents. But he was concerned about what’s considered an “acceptable error rate” in machine learning tools. “There was not an acceptable number of errors in this work,” he says.

“This was not a legal liability issue,” says Tarak. “This was about social responsibility and ethical liability.”

In Joseph and Powell’s published story in Gothamist/WNYC’s Race & Justice Unit, this editor’s note is included:

Editor’s Note: In publishing the Freedom of Information Law documents from the Staten Island District Attorney’s Office, Gothamist/WNYC relied on pro-bono technical assistance from the Human Rights Data Analysis Group in making redactions. “At HRDAG, we seek to promote accountability for human rights violations, and we believe truth and transparency are required for accountability,” said Tarak Shah, a data scientist with the organization. “Document collections like this one are an important tool in the quest for truth and accountability, but require rigorous processing to make them suitable for those purposes. For this project, we built a named-entity extractor to index each document by officer name, and an automatic redaction tool to protect the sensitive information of private citizens.”

Further reading

Gothamist. George Joseph and Luca Powell. 17 February, 2021.
The Staten Island Files: Explore Hundreds of Previously Secret NYPD Misconduct Findings

Related publications

HRDAG. Christine Grillo. 9 March, 2021
Protecting the Privacy of Whistle-Blowers: The Staten Island Files

Acknowledgments

This work was supported by the Ford Foundation, MacArthur Foundation, and Open Society Foundations.

Image: NASA. 

See more from Applying machine learning to make sense of massive caches of data


Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.

Donate