New paper in Biometrika, co-authored by HRDAG's Kristian Lum and James Johndrow: Theoretical limits of microclustering in record linkage.
In our work, we merge many databases to figure out how many people have been killed in violent conflict. Merging is a lot harder than you might think.
Many of the database records refer to the same people--the records are duplicated. We want to identify and link all the records that refer to the same victims so that each victim is counted only once, and so that we can use the structure of overlapping records to do multiple systems estimation.
Merging records that refer to the same person is called entity resolution, database deduplication, or record linkage. For definitive overviews of the field, see Scheuren, Herzog, and Winkler, Data Quality ...