from IPython.display import Image
Image(filename='data/banner.png')
Exoneration documents, documents acquired during the legal processes that seek to rectify miscarriages of justice, offer invaluable insights into wrongful conviction cases. Particularly, they illuminate the roles and actions of law enforcement personnel. Yet, the sheer volume and lack of structure in these documents pose challenges for researchers, lawyers, and advocates dedicated to transparency and justice. In 2022, the Innocence Project New Orleans (IPNO) launched the Louisiana Law Enforcement Accountability Database (LLEAD), a consolidation of data from over 500 law enforcement agencies in Louisiana. To date, LLEAD hosts details of over 40,000 allegations of misconduct spanning 194 agencies across 48 of Louisiana's 64 parishes. This initiative is the first state-wide database of its kind. LLEAD is already an essential tool for exoneration work, and including wrongful conviction information in the database would make it even more useful. For example in Orleans Parish, Louisiana, 78% of wrongful convictions have been linked to law enforcement's failure to share exculpatory evidence with the defense, a rate more than double the national average. Given this backdrop, we seek to make these collections searchable and useful for lawyers, advocates, and community members to better investigate patterns of police misconduct and corruption. In order to do so, we rely on a multi-stage process:
Metadata Compilation: We started by compiling a comprehensive CSV index. This structured approach forms the foundation of our file management system, enabling file retrieval and basic deduplication. The metadata we organize in this step includes:
Page classification: The documents in the collection are varied, representing all documents produced or acquired in the course of an exoneration case, with case timelines going back decades. After some internal review and discussions with the Innocence Project New Orleans (IPNO) case management team, we narrowed our focus to three types of documents:
Page classification involves building a classification model to categorize files (or page sequences within files) into these different types of documents. One approach is to fine-tune a pretrained convolutional neural network to label thumbnail images of document pages. Using thumbnails is advantageous because they are smaller files, resulting in faster processing and reduced computational resource consumption. This makes them an effective approach for retrieving specific types of documents from disorganized collections, as described in Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In order to use this technique, we needed training data and a pretrained model. To quickly assemble a training data set for our page classifier, we started by noticing that in many cases the file name indicated the document type. These documents were scanned by many people at different times, so we could not rely on this heuristic for comprehensive categorization of documents, but there was more than enough there to jumpstart our training process. We collected our initial training data by probing filenames for specific search terms, and reviewing and confirming that we had inferred the correct document types from the filenames. Once we had training data, we used FastAI to fine-tune the ResNet34
architecture, pretrained on ImageNet, to identify reports, transcripts, and testimonies based on page thumbnails. With the trained classifier, we were able to measure generalization performance on documents that couldn't be classified via filename, and we were also better able to target additional training data, for example by reviewing pages where the classifier had low confidence about its prediction.
Information Extraction: Currently, we're engaged in extracting structured information from the documents we've identified, and that work is the focus of the current post. Our goal is to extract structured information related to each police officer or prosecutor mentioned in the documents, such as their names, ranks, and roles in the wrongful conviction.
Deduplication: The previous step leaves us with many distinct mentions, but some individuals are mentioned many times, within the same case or across cases. Here we rely on HRDAG's extensive experience with database deduplication to create a unique index of officers and prosecutors involved in wrongful convictions, and a record and the role or roles they had in the wrongful conviction.
Cross-referencing: In the final stage, we'll cross-reference the officer names and roles we've extracted with the Louisiana Law Enforcement Accountability Database (LLEAD.co). This stage will assist us in identifying other individuals linked with the implicated officers, such as their partners, those co-accused in misconduct complaints, or those co-involved in use-of-force incidents. The list of officers associated with previous wrongful conviction cases can then be cross-referenced with the IPNO's internal data on potential wrongful convictions with the aim of uncovering new instances of wrongful convictions.
A primary task in our process is extracting officer information from documents – specifically, the officer's name and the role the officer played in the wrongful conviction. The extraction of such information is crucial for understanding the dynamics and potential lapses that led to the conviction. Given the importance of this task, it's essential to approach it with a methodology that ensures accuracy and comprehensiveness.
We initially considered a regex-based solution for this extraction task. Regular expressions, or regexes, are powerful tools for pattern matching within text. However, as we delved deeper into our data, we realized that the complexity and variability of the content rendered regex less than ideal. While regex excels at identifying specific patterns within text, it often struggles with variations in language and nuances that commonly appear in natural language texts, such as police reports and court transcripts.
Consider the text string from a court transcript reading, "Sergeant Ruiz was mentioned as being involved in the joint investigation with Detective Martin Venezia regarding the Seafood City burglary and the murder of Kathy Ulfers." Such a sentence poses challenges for regex due to its inability to capture semantic context. Without understanding the broader narrative, regex cannot infer that Sergeant Ruiz acted as a lead detective in Kathy Ulfers' murder case.
To further highlight the limitations of regex in handling such tasks, we designed a simple baseline model. Instead of attempting to capture the full scope of officer information extraction, this model focuses solely on extracting officer names as a starting point. This choice was intentional; by narrowing down the task, we hoped to provide a clear example of the strengths and weaknesses of regex in the context of real-world data.
pattern = re.compile(r"(detective|sergeant|lieutenant|captain|corporal|deputy|criminalist|technician|investigator"
r"|det\.|sgt\.|lt\.|cpt\.|cpl\.|dty\.|tech\.|dr\.)\s+([A-Z][A-Za-z]*(\s[A-Z][A-Za-z]*)?)", re.IGNORECASE)
After implementing our baseline model, we tested its performance on two different sets of data: police reports and court transcripts.
Police Reports Results:
This indicates that among the instances our model predicted as officer names, 84.5% of them were indeed officer names. A high precision suggests that the model is quite reliable in its positive predictions.
This metric reveals that our model was able to identify only 51.8% of the actual officer names present in the police reports. A lower recall signifies that while our predictions are accurate, we are missing out on a significant number of true officer names.
The F1 score harmonizes precision and recall, giving us a balanced view of the model's performance. At 0.614, it suggests that there is room for improvement, especially in capturing more true positives without sacrificing precision.
This score is another harmonic mean of precision and recall but gives more weight to recall. A score of 0.549 further emphasizes the model's challenges in identifying all true positives.
Court Transcripts Results:
Similar to the police reports, our model displayed high precision on court transcripts, indicating its reliability in positive predictions.
However, the recall is notably lower on the court transcripts, meaning our model missed out on more than half of the actual officer names present in these documents.
The F1 score for court transcripts is lower than that for police reports, suggesting a more pronounced trade-off between precision and recall in this dataset.
Once again, the F_beta score underscores the need for improving recall without compromising precision.
While our regex-based baseline model exhibits high precision on both datasets, it struggles notably with recall. This indicates that while the names it identifies as officers are likely correct, it misses a substantial number of actual officer names present in the documents. These findings further emphasize the challenges of using regex alone for such a complex task and underscore the need for more advanced techniques that can capture the nuances and variations in language.
An alternative approach is to prompt a generative language model with the document text along with a query describing our required output. One challenge with this approach is that the documents we're processing may be hundreds of pages long, whereas generative models have a limit to the length of the prompt you supply. We needed a way to pull out of each document just the chunks of text where the relevant officer information appears, to provide a more helpful prompt.
We split up the problem into two steps, identifying the relevant chunks of text content, and then extracting structured officer information from those chunks. We use Langchain, a natural language processing library, to manage this pipeline, and use OpenAI's language model, GPT-3-Turbo-16k as the language model powering the pipeline.
For the first step, identifying the relevant chunks of text within the larger document, we used the approach outlined in Precise Zero-Shot Dense Retrieval without Relevance Labels. This approach splits our information retrieval task into multiple steps:
Here is the method we use to generate hypothetical embeddings. The resulting object can be used to embed chunks of text, enabling efficient similarity search over them.
PROMPT_TEMPLATE_HYDE = PromptTemplate(
input_variables=["question"],
template="""
You're an AI assistant specializing in criminal justice research.
Your main focus is on identifying the names and providing detailed context of mention for each law enforcement personnel.
This includes police officers, detectives, deupties, lieutenants, sergeants, captains, technicians, coroners, investigators, patrolman, and criminalists,
as described in court transcripts.
Be aware that the titles "Detective" and "Officer" might be used interchangeably.
Be aware that the titles "Technician" and "Tech" might be used interchangeably.
Question: {question}
Roles and Responses:""",
)
def generate_hypothetical_embeddings():
llm = OpenAI()
prompt = PROMPT_TEMPLATE_HYDE
llm_chain = LLMChain(llm=llm, prompt=prompt)
base_embeddings = OpenAIEmbeddings()
embeddings = HypotheticalDocumentEmbedder(
llm_chain=llm_chain, base_embeddings=base_embeddings
)
return embeddings
The process_single_document
function converts an input document into a vector database of chunks. This function employs Langchain's RecursiveCharacterTextSplitter to split documents into chunks of 500 tokens, while maintaining an overlap of 250 tokens to ensure contextual continuity.
There are times when the model might inadvertently identify names without clear ties to law enforcement personnel. By cross-referencing the model's output with LLEAD, we believe we will be able to filter out many such false positives (it's worth noting that some law enforcement personnel mentioned in the documents will be absent from LLEAD, but our current focus is on officers we can track using LLEAD). On the other hand, we have no way of recovering officer mentions if they are not picked up by our extraction process. In light of this, when evaluating the model we are more interested in maximizing recall, ensuring we identify as many genuine law enforcement mentions as we can. To quantify this focus on recall, we employed the F-beta score (with β=2), which weighs recall twice as heavily as precision. We tested the model using varying chunk sizes, including 2000, 1000, and 500, with corresponding overlaps of 1000, 500, and 250 respectively. Based on our evaluations, the optimal configuration is a chunk size of 500 with an overlap of 250. After segmentation, the text is transformed into a high-dimensional space using precomputed embeddings from our hypothetical document embedder. The FAISS.from_documents function aids in this transformation, constructing an indexed document database designed for similarity searches.
def process_single_document(file_path, embeddings):
logger.info(f"Processing document: {file_path}")
loader = JSONLoader(file_path)
text = loader.load()
logger.info(f"Text loaded from document: {file_path}")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=250)
docs = text_splitter.split_documents(text)
db = FAISS.from_documents(docs, embeddings)
return db
In the following sections, we define the core function get_response_from_query(db, query). This function serves as the backbone of our information extraction process, taking in a document database and a query, and returning the system's response to the query.
The process begins by setting up the relevant parameters. We use a prompt template to guide the query and a role template to define the roles we're interested in. We set the temperature parameter to 0 to maximize the determinism of our responses. The k parameter is set to 20, a decision guided by the F-beta score results from our testing phase, instructing the system to select and concatenate the top 20 relevant text chunks from the document corpus. These documents are then sorted by similarity score to maximize the model's performance. As suggested in the paper Lost in the Middle: How Language Models Use Long Contexts, for retrieval tasks current language models perform best when the relevant data is located at the beginning of their context window.
The relevant chunks of text are then passed to the LLMChain class of the LangChain module as part of the 'run' method. In addition to relevant chunks, the 'run' method also receives the PromptTemplate, RoleTemplate, and the original query.
The LLMChain processes these inputs and generates a structured response to the initial query.
PROMPT_TEMPLATE_MODEL = PromptTemplate(
input_variables=["roles" ,"question", "docs"],
template="""
As an AI assistant, my role is to meticulously analyze court transcripts, traditional officer roles, and extract information about law enforcement personnel.
Query: {question}
Transcripts: {docs}
Roles: {roles}
The response will contain:
1) The name of a officer, detective, deputy, lieutenant,
sergeant, captain, officer, coroner, investigator, criminalist, patrolman, or technician -
if an individual's name is not associated with one of these titles they do not work in law enforcement.
Please prefix the name with "Officer Name: ".
For example, "Officer Name: John Smith".
2) If available, provide an in-depth description of the context of their mention.
If the context induces ambiguity regarding the individual's employment in law enforcement,
remove the individual.
Please prefix this information with "Officer Context: ".
3) Review the context to discern the role of the officer.
Please prefix this information with "Officer Role: "
For example, the column "Officer Role: Lead Detective" will be filled with a value of 1 for officer's who were the lead detective.
""",
)
ROLE_TEMPLATE = """
US-IPNO-Exonerations: Model Evaluation Guide
Roles:
Lead Detective
• Coordinates with other detectives and law enforcement officers on the case.
• Liaises with the prosecutor's office, contributing to legal strategy and court proceedings.
Crime Lab Analyst:
• Analyses various types of evidence gathered during an investigation, including but not limited to, DNA, fingerprints, blood samples, drug substances, etc.
• Prepares detailed reports outlining the findings of their analyses.
"""
def get_response_from_query(db, query):
# Set up the parameters
prompt = PROMPT_TEMPLATE_MODEL
roles = ROLE_TEMPLATE
temperature = 0
k = 20
# Perform the similarity search
doc_list = db.similarity_search_with_score(query, k=k)
# Sort documents by relevance scores. Place documents with the highest relevance
docs = sorted(doc_list, key=lambda x: x[1], reverse=True)
third = len(docs) // 3
highest_third = docs[:third]
middle_third = docs[third:2*third]
lowest_third = docs[2*third:]
highest_third = sorted(highest_third, key=lambda x: x[1], reverse=True)
middle_third = sorted(middle_third, key=lambda x: x[1], reverse=True)
lowest_third = sorted(lowest_third, key=lambda x: x[1], reverse=True)
docs = highest_third + lowest_third + middle_third
docs_page_content = " ".join([d[0].page_content for d in docs])
# Create an instance of the OpenAI model
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
# Create an instance of the LLMChain
chain = LLMChain(llm=llm, prompt=prompt)
# Run the LLMChain and print the response
response = chain.run(roles=roles, question=query, docs=docs_page_content, temperature=temperature)
print(response)
# Return the response and the documents
return response
For additional context, see the following inputs and outputs:
Query
"Identify individuals, by name, with the specific titles of officers, sergeants, lieutenants, captains, detectives, homicide officers, and crime lab personnel in the transcript. Specifically, provide the context of their mention related to key events in the case, if available."
Relevant Document
(1 of 20 documents identified by the Faiss similarity search as relevant)
Martin Venezia, New Orleans police sergeant. A 16 .01 Sergeant Venezia, where are you assigned now? : - A Second Police District. 13 . And in October, September of 1979 and in Q 19 September and October of 1980, where were you assigned? :1 Homicide division. A. And how long have you been on the police department right now? Thirteen and a half years. A Officer Venezia, when did you or did you ever take over the investigation of ... Cathy Ulfers' murder? A", metadata={'source': '../../data/convictions/transcripts/iterative\(C) Det. Martin Venezia Testimony - Trial One.docx'
Response from the Model
Officer Name: Sergeant Martin Venezia
Officer Context: Sergeant Martin Venezia, formerly assigned to the Homicide Division, took over the investigation of Cather Ulfers murder.
Officer Role: Lead Detective
In our effort to optimize the model's capability to extract officer names from documents, we evaluated it on various parameters. The following tests were run using GPT-4.
Preprocessing Parameters:
Model-specific Parameters:
For evaluating our model's performance, we utilized the F-beta score as our primary metric. Unlike the F1 score, which gives equal weight to precision (correctness) and recall (completeness), the F-beta score allows for differential weighting. We designed our score to weigh recall twice as much as precision, reflecting the importance of accurately spotting relevant information, even if it means occasionally flagging some irrelevant content.
Based on our evaluations, our model performed best with:
For police reports, the F-beta score reached 0.864909, while for transcripts, the F-beta score peaked at 0.813397.
Although larger chunk sizes, such as 1000 and 2000, might offer advantages for certain applications, they resulted in lower F-beta scores during our tests. Similarly, greater overlaps like 500 and 1000 reduced our performance, even with the potential for more context. The consistent advantage of incorporating HYDE embeddings was evident, underscoring their value to our model.
Another key observation was regarding the temperature parameter, which influences the model's level of randomness. With a temperature set to 1, we generally saw higher F-beta scores, especially for identifying officer names in police reports. As we move to the next phase — extracting detailed context about the officers role within the document — the precise handling of this parameter will be crucial because a high temperature can potentially skew results or generate "hallucinated" content.
def read_summary():
summary = pd.read_excel("data/overall-summary-with-F1-Fbeta.xlsx")
summary = summary.sort_values("F_beta", ascending=False)
return summary
read_summary()
chunk_size | chunk_overlap | temperature | k | hyde | filetype | FN | FP | TP | n_files | precision | recall | F1 | F_beta | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 500 | 250 | 1 | 20 | 1 | police-report | 20 | 2 | 105 | 5 | 0.981308 | 0.840000 | 0.905172 | 0.864909 |
2 | 2000 | 1000 | 1 | 5 | 0 | police-report | 12 | 32 | 71 | 5 | 0.689320 | 0.855422 | 0.763441 | 0.816092 |
0 | 500 | 250 | 1 | 20 | 1 | transcript | 3 | 27 | 34 | 4 | 0.557377 | 0.918919 | 0.693878 | 0.813397 |
1 | 500 | 250 | 0 | 20 | 1 | police-report | 6 | 56 | 60 | 5 | 0.517241 | 0.909091 | 0.659341 | 0.789474 |
8 | 2000 | 1000 | 0 | 5 | 0 | police-report | 15 | 13 | 54 | 5 | 0.805970 | 0.782609 | 0.794118 | 0.787172 |
3 | 2000 | 1000 | 1 | 5 | 1 | transcript | 3 | 11 | 17 | 3 | 0.607143 | 0.850000 | 0.708333 | 0.787037 |
6 | 1000 | 500 | 0 | 10 | 1 | transcript | 15 | 31 | 57 | 6 | 0.647727 | 0.791667 | 0.712500 | 0.757979 |
10 | 2000 | 1000 | 0 | 5 | 1 | transcript | 22 | 18 | 60 | 7 | 0.769231 | 0.731707 | 0.750000 | 0.738916 |
7 | 2000 | 1000 | 0 | 5 | 1 | police-report | 13 | 37 | 49 | 5 | 0.569767 | 0.790323 | 0.662162 | 0.733533 |
12 | 1000 | 500 | 1 | 10 | 1 | police-report | 34 | 10 | 78 | 5 | 0.886364 | 0.696429 | 0.780000 | 0.727612 |
11 | 2000 | 1000 | 1 | 5 | 1 | police-report | 37 | 19 | 86 | 5 | 0.819048 | 0.699187 | 0.754386 | 0.720268 |
9 | 500 | 250 | 0 | 20 | 1 | transcript | 19 | 29 | 53 | 6 | 0.646341 | 0.736111 | 0.688312 | 0.716216 |
5 | 1000 | 500 | 0 | 10 | 1 | police-report | 13 | 70 | 61 | 5 | 0.465649 | 0.824324 | 0.595122 | 0.714286 |
14 | 2000 | 1000 | 0 | 5 | 0 | transcript | 44 | 36 | 50 | 9 | 0.581395 | 0.531915 | 0.555556 | 0.541126 |
13 | 1000 | 500 | 1 | 10 | 1 | transcript | 16 | 32 | 19 | 4 | 0.372549 | 0.542857 | 0.441860 | 0.497382 |
After our evaluation of the model based on various parameters, the next phase delved into understanding the model's behavior over iterative runs. Due to the stochastic nature of generative text models, a single document can yield diverse outputs when processed multiple times using the same parameters. This highlighted the challenge of identifying an optimal number of iterations, a balance that ensures comprehensive insights while being cost efficient. In the interest of cost efficiency, the following tests were run using GPT-3.5-Turbo-16K. The decline in performance can be attributed to the change in model.
To address this, we employed two distinct query strategies:
Multiple Queries Approach: This strategy used six unique queries, each crafted to extract specific facets of the required information. The queries are as follows:
Query 1: Identify individuals, by name, with the specific titles of officers, sergeants, lieutenants, captains, detectives, homicide officers, and crime lab personnel in the transcript. Specifically, provide the context of their mention related to key events in the case, if available.
Query 2: List individuals, by name, directly titled as officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel mentioned in the transcript. Provide the context of their mention in terms of any significant decisions they made or actions they took.
Query 3: Locate individuals, by name, directly referred to as officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the
transcript. Explain the context of their mention in relation to their interactions with other individuals in the case."
Query 4: Highlight individuals, by name, directly titled as officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript. Describe the context of their mention, specifically noting any roles or responsibilities they held in the case.
Query 5: Outline individuals, by name, directly identified as officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript. Specify the context of their mention in terms of any noteworthy outcomes or results they achieved.
Query 6: Pinpoint individuals, by name, directly labeled as officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript. Provide the context of their mention, particularly emphasizing any significant incidents or episodes they were involved in.
Singular Query Approach: This method employed a comprehensive query, designed to holistically capture all the desired information facets. We run the document through the same query repeatedly, which results in slightly different responses each time, and then collect the results together. The query is:
In the context of police reports, a detailed analysis of the singular query method showed that its performance metrics improved with each iteration up to the 4th iteration, with a marked increase in the F1 Beta score after each iteration. However, past this iteration, the incremental gains in the score reduced, indicating we had hit the point of diminishing returns.
The analysis of court transcripts, on the other hand, offered a nuanced perspective. Both the singular and the 6 queries methods exhibited an upward trend in their performance metrics up until the 6th iteration.
Analyzing the results from both police reports and court transcripts gave us confidence in the singular query method. It consistently demonstrated a balance between performance and computational/cost demands, with diminishing performance gains beyond the 4th iteration in both datasets. Therefore, based on this analysis, the singular query strategy was selected for deployment over 4 iterations for all types of documents.
Currently, GPT-4's pricing is 0.03 per 1K tokens for inputs and 0.06 per 1K tokens for outputs. In contrast, GPT-3-Turbo-4K is priced at 0.0015 per 1K tokens for inputs and 0.002 per 1K tokens for outputs. This means using GPT-4 is 19x and 29x more expensive, respectively. Given these cost considerations, coupled with challenges our existing GPT-3.5-Turbo-16K model faced in extracting officer details from documents in our FAISS similarity database, we've pivoted our focus towards the GPT-3.5-Turbo-4K model. While the GPT-3.5-Turbo-4K itself isn't new, the capability to fine-tune it was introduced by OpenAI in August 2023, offering a promising avenue for improvement.
In order to address our model's shortcomings, we found ways to efficiently generate additional training data. Using document samples, we extracted details about individual law enforcement officers, their contexts, and roles. Recognizing the potential of GPT-4, we leveraged its capabilities to craft training documents that closely resembled our real-world challenges. We provided GPT-4 with sample documents based on authentic data, enabling it to produce outputs with the stylistic nuances we often encounter—like poor OCR quality, fragmented sentences, inconsistent capitalization, and syntactic inconsistencies.
Here's an example of the training data we generated using GPT-4:
{
"messages":
[{ "role": "system", "content":
{ "role": "user", "content":
{ "role": "assistant", "content":
}
In our fine-tuning experiments, we worked with four dataset sizes: 25, 50, 75, and 100 examples. Analyzing the outcomes, a clear trend emerged: as we increased the dataset size, the model's performance improved incrementally. Even with the constraints of the 4k token limit, which led us to adjust our K parameter from 25 to 15, our model exhibited differentiated performance across document types. It surpassed the GPT-3.5-Turbo-16k model when processing court transcripts and matched its efficiency for police reports. However, as promising as these strides are, they haven't yet reached the capabilities of GPT-4 (See appendix for GPT-4 results).
The results from our current fine-tuning experiments with GPT models provide valuable insights into the potential and limitations of AI in data extraction tasks. Our observations underscore the significance of dataset size and quality, as well as the implications of token constraints on model performance. As we move forward:
We will delve deeper into the interplay between token limits and extraction accuracy, particularly in documents with varying complexities. We'll further investigate the optimal balance between training data volume and model efficiency, exploring potential diminishing returns or inflection points. Given the differentiated performance across document types, our research will also focus on domain-specific fine-tuning to optimize extraction from court transcripts, police reports, and other legal documents.