Text: pulling bac the curtain on LLMs and Policing Data. Image: two data scientists leaning over a computer in a sunny office.

Pulling Back the Curtain on LLMs & Policing Data

Structural Zero Issue 04

September 30, 2025

Artificial intelligence is transforming how we work with information. At HRDAG, that changes how I do my job every day. My most recent project was using LLMs to explore and parse vast quantities of data about police abuses in California.

In this newsletter, I’ll pull back the curtain on that work. I’ll describe how a diverse coalition gathered more than a million pages of documents about police misconduct in California and how LLMs helped us make sense of them in ways that wouldn’t have been possible before the advent of this technology.

In addition to understanding my work, I hope that this deep dive will help you better understand how to interpret the capabilities of LLMs in general—because while these are powerful tools, they are far from infallible.

But first, we need a shared vocabulary. Many of these terms—AI, machine learning, natural language processing, LLMs—are used interchangeably in popular media, yet they mean distinct things. Understanding those differences is essential to understanding our project and, more broadly, to interpreting what today’s AI systems can and cannot do.

(If you’re already comfortable with these concepts, feel free to jump ahead to the Hunting for Records section. Otherwise, think of what follows as a mini reference guide you can return to in future issues.)

Building a Shared Vocabulary

  • Artificial Intelligence (AI)
    AI is the broad field of creating computer systems that can perform tasks we usually associate with human intelligence—finding patterns in huge datasets, learning from past information, solving problems, or recognizing images and speech. For example, an AI system might scan millions of medical images and identify subtle patterns a human radiologist could easily miss.
  • Machine Learning (ML)
    Machine learning is a subset of AI focused on algorithms that improve at a task as they process more data, without being explicitly programmed step-by-step. For example, an ML model trained on thousands of photographs can learn to recognize “peaceful” scenes—without anyone ever defining “peaceful” in code.
  • Natural Language Processing (NLP)
    NLP is the branch of AI that helps computers process and generate everyday human language, whether spoken or written. It’s what enables email spam filters, real-time translation, or chatbots that understand plain English. (Note that while the vast majority of this development is in English, there are research projects centering other languages.) Language models, which describe the likelihood of the next word in a sentence, are an important area of research and practice within NLP.
  • Large Language Models (LLMs)
    Recent innovations have facilitated the training of much larger language models than had previously been possible. These large language models (LLMs) are trained on vast collections of text and learn statistical patterns that allow them to generate fluent, natural-sounding language and to answer questions, summarize, or draft text. What they share with their smaller predecessors is that they predict words and sentences based on patterns in their training data.

For readers who want to explore further: check out Loyola Marymount’s AI Key Terms, Ball State University’s Generative AI Glossary, and Georgetown’s CSET overview. Want to go deeper? Read AI Snake Oil by Arvind Narayanan and Sayash Kapoor.

Hunting for Records

In 2018, California passed an innovative law called the Right to Know Act. This granted the public the right to access certain records related to police misconduct and use of force. So if police officers are caught lying, if they shoot someone, if they injure someone badly during an arrest—all of that must be made publicly available upon request.

But the “upon request” part is important. In order to receive these documents, you must first ask for them. And there are hundreds of agencies in California that might be covered by this law—everything from local police departments to correctional facilities to university police to the California Department of Justice.

So our first step was collecting all of these records, or as many as we could. HRDAG was part of a coalition of advocacy groups, journalistic organizations, academic groups, researchers, and journalists who dedicated resources to sending out record requests to nearly 700 agencies. I worked on a specific contract with the UC Berkeley Institute for Data Science to provide project management and data analysis for this effort.

Just sending in a request for the records wasn’t enough. Police departments would often drag their heels in responding to our requests and we’d need to send repeated follow up requests. Or, they might send partial results and we’d have to follow up to get additional information. These negotiations were typically conducted over email, and we often had hundreds of emails going out per week to request records or follow up about partial records.

Thanks to these efforts, we were able to collect and publish over a hundred thousand documents. Those documents are now viewable to the public and researchers on CalMattersKQEDLA TimesSF Chronicle, and LAist.

Subscribe

* indicates required

Structuring Unruly Data with LLMs

Flagging incomplete responses

LLMs proved instrumental from the very beginning of this project, first by helping us identify when the documents we received were likely incomplete.

One key way we uncovered when only partial records were turned over was cross-checking our results against other known databases, especially the California Department of Justice’s Use of Force Incident Reporting database, Deaths in Custody and Arrest-related Deaths database, and Civilians’ Complaints Against Peace Officers database. By looking at multiple datasets, we can often ascertain if a specific incident is missing from our documents. Importantly, comparing overlapping datasets is a key way that we extrapolate how many additional incidents might be potentially unreported. (For an example of how this works, see Violence in Blue, our examination of the under reporting of police-related deaths).

Structuring the unstructured

Once we had the responses, we dealt with a lot of raw or “unstructured” data. Unstructured data poses a unique challenge for researchers focusing on policing—something I and several colleagues explored in our 2024 article for Chance magazine.

Digital documents typically arrived in an unstructured blob where it is difficult to extract and classify information. We developed LLMs to bring structure to the unstructured: analyzing the content of the documents, organizing the information, and shedding light on what is buried within them.

For example, many documents have a date somewhere on them but the format of that date can differ dramatically (“February 4, 2025” versus “Feb 4 25” versus “02042025,” etc). The location of the date can also vary—it might be stamped on the top of multiple pages, scrawled in handwriting along the bottom, or something else entirely. We used LLMs to recognize dates even when the format, style, and location of the date varied dramatically across documents, so that researchers can query the database for “all documents within this specified date range.”

This doesn’t just apply to dates. This same technique could be used on each type of data that might be referenced in the documents, such as specific police tools (like tasers or canines), specific officer names, locations of incidents, victim information, policing techniques (like chokeholds or takedowns), or reasons for a police engagement (traffic violations, etc). We utilized LLMs to bring sense to data sets even when there were misspellings, differences in word choice, or documents organized haphazardly.

Identifying related incidents

In parsing hundreds of thousands of documents using LLMs, we then trained our LLMs to group related documents into incidents. Each police use of force incident can generate several different documents that were all referring to the same event, especially if there is a death involved. Our LLMs analyzed data points like names, locations, dates, incident descriptions, and other factors to try to notice when there were multiple descriptions of the same event—like a police report, an internal investigation, a coroner’s report, etc. We then grouped the documents into over 15,000 distinct incidents.

Re-ranking search results

Once documents had been grouped into incidents and summarized, we then used LLMs to generate summaries, which we used to rank search results for the database based on the content of the summaries. This proved key to helping researchers find more relevant information efficiently.

For example, a researcher searching for “pepper spray” is likely looking for incidents in which pepper spray was deployed by an officer. Because officers list all standard gear—including pepper spray—in nearly every incident report, a simple keyword search returns many irrelevant results. By re-indexing searches around LLM-generated summaries, researchers can zero in on incidents where pepper spray was actually used.

Redacting sensitive data

LLMs were also key to protecting the privacy of people involved. Law enforcement agencies are supposed to strip away important information about victims before documents are shared with requesters, but we found this was often not done or done incompletely. We received documents with sensitive personal information, like the names of survivors of sexual assault, home addresses of victims or witnesses, and Social Security numbers. While these are public records, our coalition wanted to take steps to reduce any potential harm. We used LLMs to identify personally identifiable information and redact it before publishing it in the database.

What is “Accurate”?

One of the most critical questions when assessing LLM outputs is accuracy—but accuracy isn’t a single, simple measure. In practice it has many dimensions.

For example:

  • Completeness: Does the model include all the relevant results?
  • Relevance: Does it avoid including irrelevant material?
  • Fidelity: Does it refrain from introducing information that wasn’t in the source data?
  • Interpretation: Does it correctly classify and summarize the material it analyzes?

Each of these is a different way a result can be accurate—or inaccurate.

For the Police Records Access Project, we used human annotation to test and validate the results generated by LLMs. Whenever we wanted to group incidents, or produce other structured outputs, human researchers manually reviewed a sample of original documents. They annotated the dates, case types, and other details that we needed our models to extract at scale. This gave us a quantifiable, verifiable way to measure model performance – the “ground truth” against which we could measure model outputs.

Given how important the Police Records Access Project is, we set a deliberately high threshold for precision: the proportion of time an extracted classification is correct. We prioritized precision over recall, the proportion of all records of a given category that we correctly label as such. This was based on the high cost we placed on incorrect outputs. We iterated our models repeatedly until we could meet those high standards.

Data Science in the Age of AI

Artificial intelligence is fundamentally reshaping the field of data science—which means it’s an incredibly exciting time to be working in this field.

But excitement must be matched with responsibility. At HRDAG, that means offering clarity in our findings, protecting sensitive personal information, and being transparent about how we test and validate our findings.

Scientific rigor and a clear-eyed understanding of uncertainty aren’t add-ons for us; they are part of our organizational DNA. We believe that nuance is essential not only for fellow scientists who want to use AI responsibly, but also for the broader public who rely on data-driven insights to understand the world.

Thank you for supporting HRDAG’s work to bring careful, transparent science to the fight for human rights.

— TS

Subscribe

* indicates required

This article was written by Tarak Shah, Data Scientist for the Human Rights Data Analysis Group (HRDAG), a nonprofit organization using scientific data analysis to shed light on human rights violations.

Structural Zero is a free monthly newsletter that helps explore what scientific and mathematical concepts teach us about the past and the present. Appropriate for scientists as well as anyone who is curious about how statistics can help us understand the world, Structural Zero is edited by Rainey Reitman and written by 5 data scientists who use their skills in support of human rights. Subscribe today to get our next installment. You can also follow us on Bluesky, MastodonLinkedIn, and Threads.

If you get value out of these articles, please support us by subscribing, telling your friends about the newsletter, and recommending Structural Zero to others.

Our work has been used by truth commissions, international criminal tribunals, and non-governmental human rights organizations. We have worked with partners on projects on five continents.

Donate