Processing scanned documents for investigations of police violence
Tarak Shah
13 July 2021
Introduction and background
Running examples
Project structure
Sample, review, label
Heuristic and machine learning approaches complement each other
Initial Steps
Creating an index
OCR
Page classification
Examples
A basic page sampler
Designing a classifier: labels
The
docid
Recovering structure from document layouts
Examples
Segmenting the page into homogenous regions
Modeling tools for sequence labeling
Turning explorations into usable features
Leave-one-document-out cross-validation
Data extraction from segmented documents
Examples
Conclusions/next steps
Acknowledgements
Appendix: Working with layout data
Introduction and background