How LLMs Made the Police Records Access Project Possible
In 2018, California passed the Right to Know Act—essentially a police transparency law— intended to increase police accountability by requiring law enforcement agencies to release records (when requested) related to officer misconduct, use of force, sexual assault, and more.
On the heels of this legislation, a coalition formed with the goal of putting the new law into practice: the Police Records Access Project, composed of members from the Community Law Enforcement Accountability Network (CLEAN). The coalition is a nationwide collaborative effort involving journalists, data scientists, public defenders, community advocates, human rights defenders, and First Amendment lawyers, and its mission is to obtain and distribute law enforcement records to the public. It’s composed of 18 organizations, among them the Berkeley Institute for Data Science, the Innocence Project, and the National Association of Criminal Defense Lawyers—and the Human Rights Data Analysis Group.
HRDAG collaborated with CLEAN to launch the Police Records Access Project, which went live in August, 2025. It’s a searchable database with millions of pages of documents related to tens of thousands of police use-of-force and misconduct cases. The records that are available through the Police Records Access Project will help researchers document patterns in police violence and build legal cases for many years to come.
Helping to build the database was a heavy lift. There are millions of documents in the database, disclosed by hundreds of different agencies throughout California, including local police and sheriffs departments, correctional facilities, university police, transit police, highway patrol, local probationary offices, the California Department of Justice, and many others. The coalition requests records from each of the nearly 700 entities every year.
Records arrived disconnected and without helpful metadata or documentation. The team used large language models (LLMs) to extract key facts which, with continuous manual supervision, were used to cluster documents into “cases” about the same incident or investigation. From there, LLM extraction was used again to extract case information required to compare disclosed cases to other state databases to measure completeness. LLMs also enabled extraction of structured fields such as dates and incident types that facilitate sifting through the cases in the database.
Within the search tool itself, LLMs are used to rank and sort search results, which has led to more relevant results appearing early in the results, improved search experience, and less time spent finding specific records. Finally, LLMs were key to removing information that might be seen as sensitive, such as Social Security numbers, the names and addresses of civilians, and medical details, flagging the records to be withheld from the public site.
Further reading
Prison Policy Initiative. Wendy Sawyer + Emily Widra. 26 January, 2026.
Data spotlight: Data projects tracking police misconduct, use of force, and employment histories.
Stanford Report. 9 September, 2025.
New database makes once-secret police records accessible to the public
Los Angeles Times. 4 August, 2025.
Police misconduct cases and investigations into police shootings in California are now available online.
San Francisco Chronicle. Megan Cassidy. 4 August, 2025.
Thousands of once-secret California police files made public in searchable database
KQED. Sukey Lewis + Mike Kessler. 4 August, 2025.
Thousands of Once-Secret Police Records Are Now Public. Here’s How You Can Use Them
Related publications
HRDAG. Tarak Shah. 30 September, 2025.
Pulling Back the Curtain on LLMs and Policing Data (Structural Zero 04).
HRDAG. Rainey Reitman. 14 August, 2025.
Millions of Pages of Police Use-of-Force Files Available Through New Searchable Database
Related videos and podcasts
HRDAG. 3 February, 2026.
Watch Now: The Wandering Officer—New Databases for Police Accountability
Acknowledgments
This work was supported by the Filecoin Foundation for the Decentralized Web, Ford Foundation, Heising Simons Foundation, Hewlett Foundation, and MacArthur Foundation.
Image: David Peters.
See more from Applying machine learning to make sense of massive caches of data
