Introduction and background

On the left: viewing a scanned PDF acquired through public records requests.
On the right: viewing extracted data from the same document as a JSON file (in Vim).

Administrative paperwork generated by abusive regimes have long been an integral component of human rights investigations. For an international example, see HRDAG’s 2010 report State Coordinated Violence in Chad under Hissène Habré. In that report, investigators used analysis of documents found in an abandoned state security force (DDS) headquarters in N’Djamena in order to demonstrate systemic human rights violations in DDS prisons and attribute command responsibility to President Hissène Habré.

In the U.S. context, in conjunction with other data sources, documents obtained through public records law can provide important evidence of state violence as well as the paper trail to demonstrate responsibility for human rights violations. Among the many formats such records can take, scanned and redacted documents saved as PDF files are common, and present unique challenges for data processing. In this post, I review the methods and software tools we use to process such collections and get them into an analysis-ready format.

Running examples

  • [UW-CHR] The University of Washington Center for Human Rights sued under the Freedom of Information Act for every form I-213 (recording apprehensions of people) produced by ICE or CBP in the Washington area for a number of years as part of the Center’s Human Rights at Home research regarding immigrant rights.

  • [ACLU-Massachusetts] The ACLU of Massachusetts is reviewing Boston PD SWAT reports (these are the reports filled out before and after tactical and warrant service operations) made public under the 17F order.

  • [II-CPDP] The Invisible Institute has been processing hundreds of thousands of pages of complaints against police and associated investigative files released through the court case Green v. Chicago, for publication through the Citizens Police Data Project and for a detailed investigation into gender-based violence called Beneath the Surface.

Project structure

At HRDAG, we organize our projects as collections of self-contained tasks, as described by my colleague Dr. Patrick Ball in this blogpost and in this lecture. Here is a task breakdown from a recent project (with names and ordering modified for clarity), which we’ll use as a running outline:

individual/foia-xyz
├── import
│   ├── index
│   ├── ocr
│   └── export
├── classify-pages
│   ├── import
│   ├── sample
│   ├── features
│   ├── train
│   ├── classify
│   └── export
├── recover-page-structure
│   ├── import
│   ├── sample
│   ├── segment
│   ├── features
│   ├── train
│   ├── classify
│   └── export
├── extract
│   ├── import
│   ├── parse
│   ├── clean
│   ├── calculated-fields
│   ├── topic-models
│   └── export
└── ...

Each data processing step is framed as a classification problem, and the classification tasks are well-suited to machine learning solutions.

Within tasks that require labeled data, the import step concatenates and pre-processes any hand labeled data, in addition to bringing in the output(s) from the previous step in the pipeline (the latter can be as minimal as creating a symlink).

Sample, review, label

At each step in the processing pipeline, we design a way to sample from the document collection. Randomly sampling and reviewing is a disciplined way to learn about the collection, as opposed to manually flipping through a few pages or just looking at the first few pages of each file, or a similar method. At the start, “review” can be informal. But to act on the information we learn through review, we need to convert our notes to a controlled vocabulary – this is the labeling step.

This approach is not unique to digitized collections. See HRDAG’s work with the Guatemalan National Police Archive Project for an example of sampling physical documents laid out in piles strewn throughout a building. In a paper from 2013, Price et al describe coding the sampled documents for presence of each of 11 different acts of interest: detentions, deaths, denunciations, physical and psychological abuse, intimidation, disappearances, writ of habeas corpus, interrogations, entering private property, sexual abuse, and surveillance. That list illustrates that the correct set of codes to use for a given project is not inherent to the document collection, but rather depends on the analytical questions we hope to address by analyzing the collection.

That example demonstrates the use of the sample-review-label pattern to understand the content of extracted data. As we’ll see in this post, we can use the same pattern to inform structured information extraction from scanned pages.

Though our sampling relies on randomness, we’re not necessarily trying to collect a representative sample of the entire database. Instead, we’re trying to maximize the variety and informativeness of our overall labeled dataset – we will be using our labeled samples to train machine learning models, and we want the models to have seen the full variety of possible examples. This focus on variety is especially useful with wide-net FOIA request data, where we are searching through large collections for incidents that represent a relatively small proportion of the documents but that have high investigative importance. As we create additional labels, we become better able to target records for review that would provide the most information value – for instance by sampling just records for which our ML classifier scores between .4 and .6 or sampling in a way that over-selects the minority class in order to guarantee a healthy mix of positive and negative examples. For more about targeting training samples, see this blogpost on “active learning”.

Heuristic and machine learning approaches complement each other

The approach outlined here is meant to facilitate the use of machine learning solutions for each stage of data processing. Sometimes, you can accomplish the classification task more readily using hand-rolled logic (aka a “heuristic” classifier). In such cases, it still helps to organize code as I have here, to reduce complexity and make it easier to iterate and improve the pipeline. And even though heuristic models don’t require explicit training data, it’s still helpful to collect a sample of hand-labeled examples to use as acceptance tests.

Indeed, a pattern that has worked often for me is to start out with a heuristic classifier before I have any training data. Take this simplified example (notice there are some OCR errors, which can complicate parsing logic):

GENERAL INFORMATION

Date: 2017-05-09
Time: 4:28pm
Location: 01D
Notes: On 2nd and Main Street, in front of the bank's
ATM

INCIDENT DETAiLS

Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est
Lorem ipsum dolor sit amet.

Assume we want to output something like:

[
  {
    title: GENERAL INFORMATION,
    content: "Date: 2017..."
  },
  {
    title: INCIDENT DETAiLS
    content: "Lorem ipsum..."
  }
]

As a first pass, it seems like we could just split on all caps lines. So our starting heuristic model would label every all caps line as title, and all other lines as content. I use the labels generated by the heuristic model to seed some training data. Note that some of the labels are wrong:

text                                                      predicted
GENERAL INFORMATION                                       title
Date: 2017-05-09                                          content
Time: 4:28pm                                              content
Location: 01D                                             content
Notes: On 2nd and Main Street, in front of the bank's     content
ATM                                                       title
INCIDENT DETAiLS                                          content
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,  content
sed diam nonumy eirmod tempor invidunt ut labore et       content
dolore magna aliquyam erat, sed diam voluptua. At         content
vero eos et accusam et justo duo dolores et ea rebum.     content
Stet clita kasd gubergren, no sea takimata sanctus est    content
Lorem ipsum dolor sit amet.                               content

Now collecting ground-truth labels reduces to correcting the outputs of the starter classifier, which is quicker than starting from scratch:

text                                                      predicted  actual
GENERAL INFORMATION                                       title      title
Date: 2017-05-0                                           content    content
Time: 4:28pm                                              content    content
Location: 01D                                             content    content
Notes: On 2nd and Main Street, in front of the bank's     content    content
ATM                                                       title      content
INCIDENT DETAiLS                                          content    title
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,  content    content
sed diam nonumy eirmod tempor invidunt ut labore et       content    content
dolore magna aliquyam erat, sed diam voluptua. At         content    content
vero eos et accusam et justo duo dolores et ea rebum.     content    content
Stet clita kasd gubergren, no sea takimata sanctus est    content    content
Lorem ipsum dolor sit amet.                               content    content

It’s helpful to continue “pre-filling” new training data with the output of the best current classifier in this way to speed up additional data collection as your classifier improves. For a more principled approach to utilizing heuristics to produce training data, check out Snorkel.

Initial Steps

...
├── import
│   ├── index
│   ├── ocr
│   └── export
...

Creating an index

Before processing any files, we always create a metadata table along with a stable identifier for each file. We use a sha1 hash to confirm file identity and truncate it for a conveniently sized file identifier:

pacman::p_load(digest, qpdf, tidyverse)

inputfiles <- list.files(INPUTDIR, full.names = TRUE, pattern = '*.pdf')
index <- tibble(filename = inputfiles) %>%
    mutate(filesha1 = map_chr(filename, digest, algo = 'sha1', file = TRUE),
           fileid   = str_sub(filesha1, 1, 8),
           n_pages  = map_int(filename, pdf_length))

stopifnot(
    "files should be unique" = length(unique(index$filesha1)) == nrow(index),
    "fileids should be unique" = length(unique(index$fileid)) == nrow(index),
)

write_delim(index, OUTPUTNAME, delim = '|')

The index should include any additional metadata available about the files. For instance, if files arrived in batches, we include the batch number. Often metadata is encoded in filenames (for example files all named ‘cases-yyyymm.pdf’), in which case we extract that metadata into an explicit column(s) in the index.

OCR

We use tesseract for doing OCR, often via the tesseract R package. For mixed format documents, we use the position and size of words on the page, along with the content, in order to accurately process them. For that reason, we use Tesseract’s option to output data in the hOCR format.

The tesseract documentation goes into some detail about what image pre-processing is and isn’t already performed by Tesseract. For the types of documents we’re looking at, the most helpful pre-processing steps involve removing lines from forms, and painting over redaction rectangles with white space (both of which improve text recognition). Helpful resources:

Page classification

...
├── classify-pages
│   ├── import
│   ├── sample
│   ├── features
│   ├── train
│   ├── classify
│   └── export
...

Examples

Recently, the Invisible Institute’s CPDP documents processing team faced a deluge of documents released through the Green v. Chicago lawsuit. Consisting of investigations of Chicago Police misconduct, the collection contains hundreds of thousands of pages of allegation forms, memos, various police administrative forms, interviews and testimonies, pictures, and even embedded audio files.

One of the CPDP team’s goals was to publish the documents on cpdp.co. However, many of the documents contain identifying information about survivors and witnesses. In order to publish responsibly, the team sought to first organize the documents into document types, and then identify document types that are known to be free of identifying information. For instance, each complaint is associated with a summary “Face Sheet” that is safe to publish. A CPDP volunteer from the Lucy Parsons Labs put together a neural network based classifier that took in both page images and OCR text and classified each document as one among a sizable universe of known document types.

At the other end of the spectrum, the University of Washington Center for Human Rights had sued under the FOIA for every form I-213 (recording apprehensions of people) produced by ICE or CBP in the Washington area for a number of years as part of the Center’s Human Rights at Home research regarding immigrant rights. Here all documents were of the same type, but they arrived in a handful of multi-thousand page PDFs. The I-213 varies in page length depending on the length of the written narrative, so we didn’t know where each document began and ended. We wrote code to classify each page as “front” or “continuation.” It was only after we’d done page classification that we could even answer a question as basic as “how many documents did we receive?” (a continuing insight from working at HRDAG is that counting things is hard).

A basic page sampler

Between qpdf::subset_pdf and qpdf::pdf_combine, we can sample pages from a variety of files, and combine them into one PDF for convenient review:

pacman::p_load(qpdf, tidyverse)

sampled <- index %>%
    distinct(fileid, filename, pg) %>%
    nest(data   = -fileid) %>%
    mutate(data = map(data, sample_n, SAMPSIZE)) %>%
    unnest(data)

partials <- sampled %>%
    mutate(input  = filename,
           output = paste0(OUTPUTDIR, '/', fileid, '.pdf')) %>%
    group_by(fileid, input, output) %>%
    summarise(pg      = list(pg),
              .groups = 'drop') %>%
    mutate(done = pmap_chr(., subset_pdf))

outfile <- paste0(OUTPUTDIR, "/combined-sample.pdf")
combined <- qpdf::pdf_combine(partials$done, outfile)

This code uses stratified sampling, sampling per file. We expect some of the variation in the look and content of pages to be captured by differences in file, and this way of sampling allows us to sample equally from longer and shorter files. Again, if we have access to additional metadata that we think would be useful in categorizing pages, we can add logic to our sampling code to ensure a diverse sample.

This is enough to start to get a feel for an unwieldy collection. Every time I’ve done this, I’ve found something that surprised me: random handwritten notes tucked between documents, rotated or upside down pages, unexpected landscape pages, document types that weren’t part of the request, etc.

Designing a classifier: labels

Once you’ve seen enough to be able to identify one or more page types that you want to label, you can modify the above script to also output a spreadsheet with the original fileid and page number plus the page number in the sampled pdf. That makes collecting ground-truth data about page types a breeze:

demo: page labeling

In this example, we are labeling “1” for front-pages, and “0” for pages that are continuations. This allows us to split aggregated PDFs into distinct documents.

You can use the collected data for acceptance tests for heuristic classifiers or as training data for ML classifiers.

Unless the types of documents in the collection are known to be constrained to one or two types (as in the UW-CHR example), sampling in this way reveals a bewildering heterogeneity. Start out by focusing on a handful of page and document types that are known to be of investigative interest, and focus on labeling pages as one of those types, or “other.” That can be valuable in itself, and by shaving off the more common and easily recognized types, you gain better visibility into the more (formally) exceptional pages in the collection.

In creating the page classifier for the Green documents, it was difficult to even figure out what the universe of document types were – each new sample produced new document types that hadn’t been seen before. However, the team was able to identify and label a couple dozen known document types and train a classifier. From there, a team member noticed that he could do clustering using the document type scores as measures – documents that got clustered together were always of the same administrative document type, even if that type had not been seen or labeled before! In this way, he was able to efficiently target new training samples for labeling, resulting in a quickly improving classifier that covered more and more of the documents in the collection.

The docid

By the end of page classification, we are able to assign a new kind of identifier. A docid uniquely identifies a document, which can span multiple pages across one or more files. To count how many documents are in the collection, we count unique docids.

From here on out, we’ll sample entire documents rather than individual pages. The document is a much more meaningful unit of analysis for non-technical subject matter experts, and so this is a good time to produce some samples for them to review.

Recovering structure from document layouts

...
├── recover-page-structure
│   ├── import
│   ├── sample
│   ├── segment
│   ├── features
│   ├── train
│   ├── classify
│   └── export
...

Administrative documents can consist of a bunch of different types of content, including:

In order to extract useful data, we need to be able to zero in on where that data is physically on the page. Tools like layout-parser and Amazon Textract may prove helpful here.

Examples

Investigators with the Invisible Institute’s ongoing Beneath the Surface investigation into gender violence (see the report “Shrouded in Silence” for more context) have been reading written reports associated with Chicago Police misconduct investigations in order to identify incidents of assault that are not otherwise visible in structured data (because of the way such incidents are coded). They had been doing this research by FOIA-ing for documents one incident at a time until the Green v. CPD data provided them, for the first time, with a collection of tens of thousands of such reports.

We already had access to structured data associated with these investigations in analyzable formats through cpdp.co. So for Beneath the Surface we focused on just extracting written descriptions of allegations from the documents, rather than a more exhaustive scrape. Utilizing the aforementioned page classifications, we identified a number of report types that include descriptions of allegations, and then created heuristic classifiers to classify each line of text as “allegation description” or “other.”

In contrast, the ACLU of Massachusetts has been reviewing Boston PD SWAT after action incident reports made public under the 17F order. For this project, we needed to support a wide variety of queries and investigations, and so sought to extract a large number of data elements from different document types and combine them into a standard schema. To facilitate this extraction, we classified pre-segmented regions of text using categories such as “section header”, “table header”, “table row”, “form element” and so on (there were between 12 and 46 such categories for different document types).

Segmenting the page into homogenous regions

In terms of page structure, ideally we’d like a classifier to label each word on the page with an element identifier (that groups together words that belong to the same element, such as a multi-word form label) and the category associated with that element (“table heading”, “form element”, etc.). In practice, it’s a lot easier to use some layout analysis to pre-group words into elements and design a classifier to work on those pre-grouped elements.

Often, horizontal lines of text spanning the width of the page are an appropriate unit for labeling. In fact, for some documents where layout does not convey that much information (often the case with court transcripts and letters, for example), we can just work with lines of text without ever considering bounding boxes.

Other times, it’s useful to cluster words based on their position. For instance, notice how we’re able to simplify our task through clustering for this example from the ACLU-Massachusetts. The green boxes on the right side show the groupings that resulted from clustering:

We went on to classify each group of words as one of “administrative info,” “warrant info,” and so on. It’s much easier to classify groupings like this, creating features based on presence or absence of specific words in the group, than to work at the level of word or line. For more on working with layout data, see the Appendix to this post.

Modeling tools for sequence labeling

An important insight about the classification task for this stage of the pipeline is that both the inputs and the outputs are sequential, there are significant correlations between adjacent labels (think of how a “table header” line is nearly always followed by a “table row”). We get much better results when using modeling tools that can represent this kind of dependency structure, such as Conditinal Random Fields and Long Short Term Memory Networks.

Given the specifics of our situation – small amounts of training data, rich handcrafted features, and predictors that combine both layout/positional information as well as text content, Conditional Random Fields models have become our go-to choice for this step in data processing. Hannah Wallach’s website contains a number of useful references about CRF models, I particularly appreciated these papers about structured extraction from documents using CRF’s:

Turning explorations into usable features

At the start of a line-labeling task, I’ll often do a lot of interactive exploration. For instance, if it looks like a lot of section headings are in all caps, I’ll do someting like this in an R console to see what pops up:

lines_df %>%
    filter(str_detect(text, "^[A-Z ]+$")) %>%
    sample_n(15)

# or, if trying to enumerate the universe of headings:
lines_df %>%
    filter(str_detect(text, "^[A-Z ]+$")) %>%
    count(text, sort = TRUE)

Those exploration sessions are valuable for developing intuitions and catching surprises. By keeping a running yaml file of useful searches, you can easily leverage your exploring to generate high quality features for a classifier. For example:

features
├── Makefile
├── hand
│   └── regexes.yaml
├── output
│   ├── corpus-features.parquet
│   └── training-features.parquet
└── src
    └── feats.R

Here hand/regexes.yaml contains your saved exploratory searches:

re_caps: "^[A-Z ]+$"
re_lows: "^[a-z ]+$"
re_numb: "^[0-9]\\."
re_romnum: "^[MDCLXVI]+\\W"
...

Now, just a few lines of code in src/feats.R can calculate a batch of useful features, encoding pattern matches for a window of lines:

pacman::p_load(arrow, dplyr, stringr, yaml)

docs <- read_parquet(args$docs)
regexes <- read_yaml(args$regexes)

out <- docs %>%
    arrange(docid, pgno, lineno) %>%
    mutate(map_dfc(regexes, str_detect)) %>%
    group_by(docid) %>%
    mutate(across(starts_with("re_"),
                  .fns   = list(nxt = lead, prv = lag),
                  .names = "{fn}_{col}"))

write_parquet(out, args$output)

The resulting table includes columns re_caps, prv_re_caps, nxt_re_caps, re_lows, and so on. It’s helpful, as you explore the collection, to be thinking of tricks like this to transform intuitions into machine-friendly features.

Leave-one-document-out cross-validation

Though we’re labeling at the level of individual lines or other pre-segmented groups of words, our samples are at the level of entire document. Furthermore, it’s helpful to start with small batches of training data (I’ll start with about a dozen labeled documents), and target new samples (for instance by selecting for labeling those documents where the classifier has low certainty about its outputs) as necessary based on performance on held-out data. To make the most of small amounts of training data, during training and parameter tuning I use Leave One Document Out (LODO) cross-validation. Especially early on, it’s helpful to review the worst-case performance – this can help to inform the next batch of training samples. The ability to work with nested list-columns in R facilitates this kind of grouped sampling. The following code can be extended to iterate over a grid of parameter values to identify the best ones:

pacman::p_load(crfsuite, tidyverse, rsample)

cv <- nest(trainfeats, data = -docid) %>%
    loo_cv %>%
    mutate(train = map(splits, training),
           test  = map(splits, testing))

results <- cv %>%
    mutate(model      = map(train, crf_fit),
           test_preds = map2(model, test, crf_predict)
           metrics    = map2(test, test_preds, crf_metrics))

The crf_* functions are helper functions, crf_fit and crf_predict are wrappers for the associated functions in the crfsuite package, whereas crf_metrics takes ground truth and predicted labels and tallies the four quadrants of the confusion matrix, which are then used to produce evaluation statistics for each set of candidate parameters.

Data extraction from segmented documents

Having segmented a collection of scanned pages into documents, and divided documents into their constituent sections, we move to pulling data into a standardized schema. In the parsing step, we dispatch tables to table-extraction code (see appendix on working with layout data), convert detected form elements into key-value pairs, process lists and unstructured text, and so on. These extracted data become the building blocks for an analysis-ready dataset, but getting there still requires additional work.

...
├── extract
│   ├── import
│   ├── parse
│   ├── clean
│   ├── calculated-fields
│   ├── topic-models
│   └── export
...

Examples

The Boston PD SWAT reports came in multiple layouts, and we designed a software tool that exhaustively pulled out and parsed all data from each document into a separate JSON file. From that point, we needed a lot of custom logic in order to produce a dataset that could drive a human rights investigation.

For example, given the large number of firearms and other military equipment that the Boston PD document bringing to each raid (which we know because we extracted those tables), we wanted to be able to identify and flag incidents where the SWAT team encountered children upon entering a home. This information was encoded in a variety of different places and ways.

Example shows multiple, inconsistent ways that the presence of children was encoded in different reports. Detailed logic goes into each field of an analysis-ready dataset, even after parsing. This logic is driven by analytical needs, and will vary from project to project, even when relying on the same set of documents.

Some reports included a column for age in the table documenting people encountered, so we were able to create an indicator for whether any of the people was under 18. Other documents did not include such a column, or instead recorded date of birth, which was redacted. Additionally, in the planning section documenting the decisions leading up to the raid, some of the reports included an indicator of whether the team expected to encounter children during the raid (in the section “Potential Hazards”). That section was inconsistent, and when it was used it was almost always “Unknown” (a remarkable observation itself!). Still, we were able to identify some additional records that way. Finally, all reports included a written narrative, which sometimes mentioned children encountered (even when that wasn’t indicated anywhere else). We used some regular expression searches to identify these narratives.

Altogether, we had to look in 3 different places and use some fairly detailed logic in order to create a single true/false indicator column in our standardized output. We went through similar processes to create indicators for the presence of various pieces of military equipment, the use of stun grenades and chemical agents, warrant indicators (e.g. whether a raid was for service of a drug warrant), and the work continues.

This step in data processing once again lends itself to sampling, reviewing, and labeling. For Beneath the Surface, Trina Reynolds Tyler of the Invisible Insitute has assembled a team of community organizers, human rights experts, and other community members to read and code extracted allegation descriptions from police misconduct documents. We’re coding the narratives with the following set of categories: “sexual violation,” “home invasion,” “stalking/domestic violence,” “policing parents & children,” “neglect,” “LGBTQIA,” “disabled,” and “use of force.” Then we use topic modeling to create features from the unstructured allegation text. These features are used as inputs to a classifier trained on the labeled documents, which in turn can aid human rights investigators in sifting through the hundreds of thousands of pages of Chicago police misconduct data. This process is similar to the one my colleague Patrick Ball and I described in this earlier blogpost

Example of unstructured allegation text in a complaint against the Chicago PD. This complaint was officially categorized as “Improper Search of Person.” The Beneath the Surface investigation seeks to raise the visibility of these types of incidents.

Conclusions/next steps

This post describes how to organize and process digitized document collections in order to aid human rights investigations, using various examples of police accountability work in the United States. We’ve seen how sampling and machine learning are invaluable tools in exploring and extracting data from a collection.

In each of our three examples, the document collection is not an exhaustive account of all potential human rights violations. An act of police violence only results in a complaint record if a witness or survivor is able to come forward with the allegation. Harassment of immigrants can take many forms beyond arrests. And SWAT raids are just one of the ways that police might forcefully enter someone’s home. Understanding that our data reflects under-registration and selection bias is key.

Carefully structuring and tagging documents allows investigators to search for relevant information when collections are too large to review comprehensively, and also makes it easier to integrate the documents with data from other complementary sources, giving us a more complete record of police violence.

Acknowledgements

II-CPDP team: Trina Reynolds Tyler, Rajiv Sinclair, Chaclyn Hunt, Ola Ziwa, Gabrielle Bozarth, Sukari Stone, volunteers from the Lucy Parsons Labs

UW-CHR team: Phil Neff, Maria Gargiulo

ACLU-Massachusetts team: Aaron Boxer, Lauren Chambers, Kade Crockford

And ongoing discussions with my colleagues Drs. Patrick Ball and Megan Price

Appendix: Working with layout data

Layout data includes both content (text), along with bounding boxes for each element. For instance, Tesseract will output bounding boxes for each word detected on a page, in the HOCR format which also nests words within lines, paragraphs, and blocks.

In situtations like this, some information is encoded in each bounding box – the size of the box, the text of the word it contains, the position of the box on the page. But a lot of information is also contained in the spatial relationships among the bounding boxes. It’s helpful to think of the layout as a graph. Every bounding box is a node, and we draw an edge between two nodes whenever they are adjacent to one another, as defined by some similarity condition. In the following snippet, boxes2edges takes a data frame of bounding boxes along with a suitable definition of adjacency and returns the edges in the resulting graph. add_groupid builds on that foundation, pulling out connected components from the layout graph.

pacman::p_load(tidyverse, tidygraph)

boxes2edges <- function(boxes, similarity, threshold) {
    n_box <- nrow(boxes)
    result <- vector("list", n_box)
    for (index in seq_len(n_box-1)) {
        ind2         <- index+1
        box1         <- boxes[index, ]
        boxk         <- boxes[ind2:n_box, ]
        sim          <- similarity(box1, boxk)
        neighbor_ids <- boxk$id[sim > threshold]
        if (length(neighbor_ids) > 0)
            result[[index]] <- tibble(from = box1$id, to = neighbor_ids)
    }
    bind_rows(purrr::compact(result))
}

add_groupid <- function(boxes, ...) {
    edges <- boxes2edges(boxes, ...)
    tbl_graph(edges    = mutate_all(edges, as.character),
              nodes    = boxes,
              directed = F) %>%
    mutate(groupid = group_components()) %>%
    as_tibble("nodes")
}

You can get pretty far by defining similarity in terms of Euclidean distance, but there are other metrics that are especially well-suited for page-layouts. Consider this relatively simple example, a region identified by our structure-extraction tools as a table header followed by table rows:

In order to recover the column structure, I can create my bounding box graph using overlap in the x dimension.

overlap <- function(a.start, a.end, b.start, b.end) {
    unadjusted <- pmin(a.end, b.end) - pmax(a.start, b.start)
    unadjusted <- pmax(0, unadjusted)
    max_overlap <- pmax(a.end-a.start, b.end-b.start)
    unadjusted/max_overlap
}

x_overlap = function(box1, box2) {
    overlap(box1$x0, box1$x1, box2$x0, box2$x1)
}

columns <- boxes %>%
    add_groupid(similarity = x_overlap, threshold = 0) %>%
    mutate(column = letters[groupid]) %>%
    group_by(column) %>%
    summarise(x0 = min(x0), x1 = max(x1), y0 = min(y0), y1 = max(y1),
              .groups = "drop")

The resulting groupings nicely group data into columns. If rows are multi-line, we can use similar logic to identify rows.