In [1]:

from IPython.display import Image
Image(filename='../imgs/banner.png')

Out[1]:

No description has been provided for this image

October 21, 2024

Table of Contents¶

Improving LLM-driven information extraction from the Innocence Project New Orleans' wrongful conviction case files
Introduction
Parts in this chapter
Key Findings
Resources
Acknowledgements

Improving LLM-driven information extraction from the Innocence Project New Orleans' wrongful conviction case files ¶

Ayyub Ibrahim, Director of Research for LLEAD, Innocence Project of New Orleans
Bailey Passmore, Data Scientist, Human Rights Data Analysis Group

Introduction ¶

Exoneration documents, documents acquired during the legal processes that seek to rectify miscarriages of justice, offer invaluable insights into wrongful conviction cases. Particularly, they illuminate the roles and actions of law enforcement personnel. Yet, the sheer volume and lack of structure in these documents pose challenges for researchers, lawyers, and advocates dedicated to transparency and justice.

In this follow-up chapter, we explore how recent advancements in large language model ("LLM") technology, including expanded context windows and more cost-effective high-performing models, impact our extraction strategies. We reexamine our information extraction pipeline by evaluating the performance of proprietary models and open-source alternatives in extracting police officers' names and roles from wrongful conviction case files. By concentrating on the specific task of identifying police officers names from court documents, we can assess LLM performance in a constrained legal context that is more amenable to large-scale evaluation due to its relative simplicity. This targeted approach differs from more complex tasks, such as writing legal briefs, attempted by generative AI legal research tools like Lexis+ AI and Casetext, which recent studies have shown to be susceptible to high rates of hallucination 1. Unlike generative tasks where correctness can be subjective, identifying specific entities allows for clear, binary assessments of performance (correct identification vs. incorrect or missed identification), enabling more robust, quantifiable, and familiar evaluation metrics.

In the previous chapter, we explored the application of large language models (LLMs) for structured information extraction from wrongful conviction case files using retrieval augmented generation (RAG). However, recent advancements in LLM technology have necessitated a re-evaluation of our information extraction pipeline. Models like Claude 3 Opus/Sonnet/Haiku with a 200k context window now allow entire documents to fit within a single context window, potentially eliminating the need for retrieving specific document pages. Furthermore, the emergence of cost-effective yet high-performing models like Claude Haiku make it feasible to iterate over every page in a document, rather than attempting to extract only the most relevant pieces for analysis.

Three main approaches were evaluated:

Full Context: This processes the entire document at once, utilizing the full context window capabilities of advanced models.
All Pages: This method processes every page in the document individually.
Named Entity Recognition (NER) Based Filtering: This preprocesses the document to identify and analyze pages with the highest concentration of entities.

Parts in this chapter ¶

There are two main branches to this chapter.

Part 1 is a review of the methods, metrics, and results. It includes comments on the design of the approach and role of prompt engineering in mitigating false positives, as well as the financial cost and practical signficance of the findings.

Part 2 is a deep dive into two input documents with disparate recall scores. It explores how the characteristics of an underlying document or entity mention could influence the true positive extraction rate and inform opportunities for refining the approach.

Key Findings ¶

Part 1¶

The all pages processing approach consistently outperformed other methods, with Claude 3.5 Sonnet achieving the highest recall score of 0.93. Claude 3 Haiku demonstrated impressive performance (0.91 recall) at a significantly lower cost.
Open-source Mixtral 8x7b and 8x22b models, while scoring lower (0.76 and 0.69 respectively) than the Claude models in the all pages processing approach, offer significant advantages in terms of local execution and data privacy, as they can be run on local hardware without sending sensitive information to external servers.
The full context approach showed significant limitations in practice despite its theoretical potential.
Vision-capable models were also tested, with Claude 3.5 Sonnet Vision achieving a recall of 0.67 and Claude 3 Haiku Vision scoring 0.65 in the all pages approach, suggesting potential for vision models in document analysis tasks.

Part 2¶

The quality of the text extraction played a not-insignificant role in how each LLM performed on the entity extraction task for a given document. When more of the original sentence structure and context was available in the text extraction, the LLM had a better opportunity to parse and identify information relevant to the prompt.
Smaller models (Mixtral 7b, Claude Haiku) have a more "fast and loose" approach to entity identification that can lead to hundreds of false positive extractions, but can also allow for these models to outperform their larger counterparts by capturing incomplete references at a comparable or lower cost.

Future research ¶

In the time it took for us to research and write this chapter, there were further advancements in LLM technology and our thoughts on how these tools can be incorporated into a principled data processing stream.

How might iterative LLM-driven summarization tasks aid in largescale information extraction?
What would processing documents in non-English languages like Spanish, Tamil, and Sinhala look like? What are the benefits of prompting in the same language as the text?
In extracting officer names, how does more granular title or rank information affect entity recognition accuracy?
What tasks are best suited for a larger, more expensive model? What is the relationship to this decision and characteristics such as document length, structure, and prompt?

Resources ¶

[1] The Future of Computational Law. Cross-disciplinary Research in Computational Law (CRCL), 2(2) (https://journalcrcl.org/crcl/article/view/62/28).

Acknowledgements ¶

Big thank you to HRDAG's Tarak Shah, Dr. Megan Price, and Dr. Patrick Ball for their contributions to this chapter.