Millions of Pages of Police Use-of-Force Files Available through New Searchable Database

A new, public database will bring more oversight to police abuses in California—and may serve as a model for police accountability for other states across the country.

HRDAG was part of a coalition behind the recently-launched Police Records Access Project. The new searchable database includes millions of pages of documents about tens of thousands of police use-of-force and misconduct cases. The database includes documents related to three types of law enforcement incidents in California: instances in which a police weapon was discharged, instances where officers used force that resulted in serious injury or death, and instances in which an agency determined an officer violated certain department rules (like lying or sexual abuse). 

HRDAG’s data scientist Tarak Shah, working on a contract with the UC Berkeley Institute for Data Science, was a key leader of the project, acting as a program manager to help ensure the successful launch of the new database. As Tarak said: “This database project will allow the public to better understand police violence in California. I hope that our research can be a resource for communities seeking justice and police accountability.” 

Getting Records Was Half the Battle

The Police Records Access Project began with a law passed in California, the 2018 Right to Know Act, which granted the public the right to access certain records related to police misconduct and use of force.  

There are nearly a hundred thousand documents in the database, but it wasn’t easy to get them. One problem was that these documents are held by hundreds of different agencies throughout  California, including local police and sheriffs departments, correctional facilities, university police, transit police, highway patrol, local probationary offices, the California Department of Justice, and many others. The coalition had to request records from each of the nearly 700 entities.

Then the negotiations began. Police departments might not respond to requests swiftly or might send incomplete documents. Solving that problem took creativity. As Tarak explained, “We developed software to help our records requesters analyze the records we received and flag disclosures that were likely incomplete so that we could follow up with specific agencies that sent only partial records.” 

One important way that incomplete records were flagged was cross-checking results against other known databases, especially the Use of Force Incident Reporting database, the Deaths in Custody and Arrest-related Deaths database, and the Civilians’ Complaints Against Peace Officers database, all published by the California Department of Justice. The technique of cross-checking data across multiple databases is a core method for human rights data scientists and it’s a cornerstone for uncovering evidence of incomplete data. 

Our coalition often sent repeat requests for information to law enforcement agencies, waiting months for replies. 

Another challenge we faced was how police departments understood their obligations under the law. While they are obliged to share records related to force that caused “great bodily injury” or death, there wasn’t a clear understanding of what counted as great bodily injury. “Sometimes, important records were initially withheld due to an overly conservative interpretation of what counts as great bodily injury,” Tarak explained. He also pointed to a lawsuit where journalists successfully won access to a large swath of records initially withheld by the police. As the San Jose Mercury News reported, the records revealed that the Richmond police used force that caused significant injuries 122 times over a six-year period—and more than half of those incidents involved police canines.

LLMs for Good

Like many of the large-scale data projects HRDAG works on these days, this project used Large Language Models (LLMs) to complete tasks that would have been impossible, or much more expensive, in an early era.

Records arrived disconnected and without helpful metadata or documentation. The team used LLMs to extract key facts which, with continuous manual supervision, were used to cluster documents into “cases” about the same incident or investigation. From there, LLM extraction was used again to extract case information required to compare disclosed cases to other state databases to measure completeness. LLMs also enabled extraction of structured fields such as dates and incident types that facilitate sifting through the cases in the database.

Within the search tool itself, LLMs are used to rank and sort search results, which has led to more relevant results appearing early in the results, improved search experience, and less time spent finding specific records. Finally,  LLMs were key to removing information that might be seen as sensitive, such as Social Security numbers, the names and addresses of civilians, and medical details.

It was important to check the accuracy of the machine learning models. Tarak and the team did this by sampling a selection of the data and manually labeling them, and comparing these hand-labeled “ground truth” values to extracted values. Given the high cost associated with publishing incorrect facts, they set high thresholds for precision for published extracted fields, and iterated on the extraction logic, collecting more labeled data as required, in order to meet those thresholds. “It’s important to have clearly defined criteria for ‘correct’ or ‘incorrect’, and clear and calculatable metrics for quality that are relevant to perceived risks and needs, and then we can analyze the degree of accuracy for our LLM,” Tarak explained. 

“This type of data tagging and analysis would have been very difficult before the latest wave of LLMs. Human annotation does not scale up that well. Traditional machine learning techniques help a lot, but would struggle with this project’s combination of heterogeneity in the types of documents in the collection, along with granularity of extraction requirements.”

This project relied on grants of compute credits on Microsoft Azure to handle the massive data storage and computing needs for managing and extracting data from such a large collection (approximately 23TB and counting). The team also continues to iterate on open source models for handling data processing and data extraction.  “Even with fine-tuned open source models and pipelines, right now the accountability community does not collectively have access to the amount of compute it would require to fully liberate ourselves from reliance on corporate generosity. But over time, that’s definitely something we’re interested in,” said Tarak.

Early Insights

The records that are available through the Police Records Access Project will help researchers understand police violence for many years to come, but already some patterns are showing up in the data. Tarak was struck by how there were many instances of death reported in the database where the cause of death was listed as something other than the law enforcement officer’s action. The pattern looks something like this: officers engaged a civilian and used some form of force, such as a prone choke hold, and then the civilian died. Instead of a police homicide, the cause of death in the medical examiners report might list something like “acute drug toxicity.” This observation echoes recent reporting.

Tarak noted that in California, unlike in many other states, medical examiners can be part of the sheriff’s department

How to Use the Database

The Police Records Access Project database (also known as the CLEAN database) can be found at any of the following links:

https://clean.calmatters.org/

https://policerecords.kqed.org/

https://clean.latimes.com/

https://clean.sfchronicle.com/ 

https://policerecords.laist.com/

Anyone can begin searching the database and downloading documents. For example, searching the database using terms like “spit hood” (a head covering used by police on arrestees) or “canine” or “pepper spray” can yield documents that show many records where those tools were used. 

Going Forward

Maintaining and updating the Police Records Access Project is an ongoing commitment. To ensure it remains relevant and accurate, coalition partners will continue submitting public records requests to local agencies, pushing for thorough responses, and organizing and tagging data to include in the database. They’re also working to add additional structured data to the database.

The hope is that this will continue to be a vital resource for reporters, defense attorneys, community organizers, and those who have lost loved ones to state violence. It can also serve as a tool for accountability: police officers who use inappropriate force in one jurisdiction cannot as easily move to another jurisdiction in California leaving a publicly-accessible data trail. With improved monitoring of police abuses across California, advocates have a better chance of getting abusive officers prosecuted or decertified (losing their certification as a law enforcement officer). It can also be supportive of the families of victims seeking justice through civil prosecutions. 

This type of oversight can also fuel reform efforts. Tarak pointed to the case of Berkeley, CA resident Kayla Moore, a transgender, Black woman who was killed by suffocation in her bed by Berkeley police after a friend called police to report concerns about Moore’s mental condition. Moore’s survivors and the broader community of Berkeley have honored Moore’s memory by advocating for compassionate responses to mental health crises.

The impact doesn’t end with California. Other states looking to improve police accountability can adopt their own versions of the Right to Know Act and create databases of records like the Police Records Access Project.  “I’ve already started meeting with researchers in other states that are looking to build similar projects,” Tarak confirmed. 

Want to hear about data projects like this? Subscribe to our Substack newsletter for the in-depth articles about our data analysis in support of human rights. 

A Coalition Effort

Funding for the project was provided by the State of California, the Sony Foundation, and Roc Nation. Our partners on this project were:  

ACLU of Southern California

ACLU of Northern California

Bay Area News Group / Southern California News Group

Berkeley Institute for Data Science

Big Local News at Stanford University

CapRadio

EPIC Data Lab

Human Rights Data Analysis Group

Innocence Project

The Investigative Reporting Program at the UC Berkeley School of Journalism

KPCC + LAist

KQED

Los Angeles Times

National Association of Criminal Defense Lawyers

UC Berkeley School of Law

UC Irvine School of Law

 

Image credit: CC-BY-NC 2.0 Image from Thomas Hawk


Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.

Donate