In [2]:
from IPython.display import display, Math, Latex
from IPython.display import Image

How do we find duplicates among multiple, giant datasets?

Implementing adaptive blocking for database deduplication, part 1

In our work, we merge many databases to figure out how many people have been killed in violent conflict. Merging is a lot harder than you might think.

Many of the database records refer to the same people--the records are duplicated. We want to identify and link all the records that refer to the same victims so that each victim is counted only once (there are a lot of other reasons we want deduplicated data). This process is called database deduplication or record linkage.

Database deduplication has been a very active field of study in statistics and computer science for decades.[1] It's very hard because records that refer to the same person may have slightly different information. Sometimes names are spelled a little bit differently, sometimes birth or death dates are slightly different, and so forth (see Figure 1 below for an example). The records usually lack social security numbers or some other kind of unique identifying number.

Database deduplication is so hard that in this five-part post, I'm only going to talk about the first step of our framework. At the bottom of this post, I will present an overview of our full approach.

[1]: For definitive overviews of the field, see Scheuren, Herzog, and Winkler, Data Quality and Record Linkage Techniques (2007), and Peter Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (2012).

Figure 1: a hypothetical database of people, with duplicates

id first name last name birth year sex
1 John Smith 1965 M
2 Jon Smith 1965 M
3 Johnson Ralph 1980 M
4 Jane Smith 1966 F
5 Ray Smith 1985 M
6 Ray Smyth 1986 M
7 Jane Jones 1966 F
8 Ralph Johnson 1980 M
9 J. Smith 1965 M
10 Janie Smith 1966 F
11 Ann Jones 1970 F
12 Amy Jones 1975 F
13 Amelia Jones 1975 F
14 Juan Smith 1965 M
15 Juan Perez 1965 M
16 John Smith 1995 M
17 John Smith 1966 M

This post addresses the first step in database deduplication, called indexing or blocking. The approach here synthesizes several adaptive, learning-based blocking techniques, but I spend a few paragraphs explaining why we're doing it, and how it's done.

This post does not require that you know much about computer science or statistical theory. It will be a detailed exploration, but I'm keeping the math to a minimum. In parts 2, 3, and 4 of the post, there is a good bit of code in python using the pandas and numpy packages. Some ideas can only be expressed in code.

We have done database deduplication in many projects over the years to look at homicides. We've done various combinations of sampling, hand labeling, and automated approaches to study documented killings in Guatemala, Kosovo, Timor-Leste, Colombia, and Syria. In particular, the approach we use now has been directly applied to our recent work in Colombia (see our report) and Syria (for example, in this report). My work on this optimization has been motivated by new data we've received on killings in Syria: there's so much data that we need to improve our methods to manage it.

There are a lot of challenges in deduplication. Documentation often captures slightly different information for each person, so duplicate records are sometimes somewhat different from each other. Figure 1 gives an example of a database (with invented records).

Notice that there are a lot of likely duplicates, for example, records (1, 2), (1, 9), and (2, 9) are probably the same person, and records (4, 10) are also likely the same. Other records are hard to tell, like (1, 14): is Juan the same as John? Or (1, 16) and (1, 17): is there a typo in the year of birth for records 16 or 17? Sometimes there is additional information that allows people to tell whether the records are the same or different, but the additional information may not be usable by a computer.

In very general terms, to find all the duplicates, we have to compare every record in the database to every other record. Unfortunately this can mean a very large number of comparisons. In our Syria project, we have about 360,000 records, including many duplicates. That means that there are around 65 billion possible comparisons between pairs of records.[1] The pairs that might be compared are called the candidate set. Comparing records directly takes time, and a candidate set of over 65 billion pairs is huge, so even if we can use cool techniques from big data and cloud computing, 65 billion comparisons means a lot of computer time and a lot of disk space.

In our current project, using a big computer in the cloud, 100 million comparisons takes about 8 hours and generates a file of about 50 gigabytes. To do all the comparisons would require something like 100 computers running for 50 hours each, and it would create over 32,500 gigabytes of data.

[1] If the database has n records, there are n(n1)2 possible pairwise comparisons.

Make the problem smaller

One way we manage this problem is to reduce the number of comparisons (i.e., the size of the candidate set) by deciding ahead of time to only consider comparisons that might be duplicates. That way we have many fewer comparisons to make. However, we have to choose the pairs we are going to compare without actually comparing them.

We do this by putting the records into subgroups, called blocks, and only comparing pairs of records inside each block. For example, using the data in Figure 1, there are 16 records and 120 possible comparisons. We might only consider pairs of records that share a common first name; this rule is called "blocking on first name." However, looking back at Figure 1, blocking on first name would miss most of the likely duplicates: (1, 2) and (12, 13), for example. We need a better blocking rule.

What if we compare records that share birth year and sex? This creates 9 blocks and only 17 pairs in the candidate set. Most of the likely pairs are included, for example, (1, 2), (1, 9), and (12,13), and it catches a tough one, the reversed first-last name pair (3, 8). Still, it misses the possible birth year typo matches for (1, 15) and (1, 16).

In a bigger dataset, this rule might have two problems. First, it's a very inclusive rule. The way we talk about the efficiency of a blocking rule is with two measures. First, we consider the reduction rate (RR). This is the proportion of possible comparisons that are excluded by the rule. Our birth year and sex rule creates 17 comparisons from the possible 120, so the RR = 1 - (17/120) = 0.858. That might sound good, but if we used a similar reduction rate for the Syria data, we would be left with more than 9 billion comparisons--which is still way, way too many. For data on the size of Syria, we need a RR of 0.999, or more.

Second, this rule by itself doesn't get at enough of the possible comparisons. As we mentioned earlier, there are two pairs with possible typo matches: (1, 15) and (1, 16). We call these missed pairs uncovered, and our goal is to get higher coverage. We need better blocking rules, and finding the best possible blocking rules is what this post is all about.

Later I'll define pairs completeness (PC) which is the second way to measure the quality of a blocking scheme.

How it used to be done

In Ye Olden Tymes (before about 2005), we tried to define blocking rules based on what we knew about the data. We'd say, gosh, I think that most real matches will share birth year and sex or last name. This is a blocking scheme made up of two rules. The pairs we consider will be all the pairs that are generated by either rule.[1] The first rule is made up of two fields that together create a single block ("birth year and sex"); this is called a conjunction. The two rules are joined by an or, which is called a disjunction.[2] When we combine rules into a scheme, we are creating a disjunction of conjunctions. In formal notation, our scheme now looks like this: (birth_yearsex)last_name.

[1] A pair might be generated by more than one blocking rule. Toward the end of this post, I suggest a method for eliminating repeated comparisons of the same pairs.

[2] I get the terms conjunction, disjunction, reduction rate, and pairs completeness from Michelson and Knoblock (2006), although I think they were in use before that article. It's a great article, and if you're interested in blocking, you should read it.

Is it optimal? Or is it even any good at all?

But how do we know this is the best possible blocking scheme? Or even whether it's a very good? I mean, we could get a perfect RR simply by comparing zero pairs, but that would be a terrible scheme because it would miss every single duplicate. We need a way to measure whether a blocking scheme does a good job covering the likely pairs while still getting a good RR.

Let's say a person has looked at a little bit of the data, and she's told us about a few of the duplicates. This is called training data, and it's very important stuff, but it tends to be expensive to get because a person has to concentrate hard to make these decisions. It's tedious, time-consuming, and vital work. In our hypothetical example, our person (the matcher) told us about the set of pairs below. In this listing, the inner brackets show two records, and the red letters tell us whether the human matcher said it was a match ('Y') or not ('N').

In [1]:
training_data = [(1, 2, 'Y'), (1, 9, 'Y'), (12, 13, 'Y'), (1, 16, 'Y'), (3, 8, 'Y'), (14, 15, 'N')]

Note that the matcher told us that one pair, (14, 15), is not a match, as well as five matches. Given this training data, we can evaluate a blocking scheme for pairs completeness (PC). PC is measured as the number of known matched pairs covered by the scheme divided by the total number of known matched pairs. Our scheme (birth_yearsex)last_name covers all of the known matches, so PC=1.0. Yay! A good scheme has both high PC and high RR.

But do we know all the matches? What if the training data isn't complete? This problem may just keep getting harder.

Alas, it is going to get harder before it gets easier

We shouldn't get too comfortable: our scheme misses the likely match (5, 6) because the human matcher hasn't seen it yet. She hasn't told us that there might be matches where the first and last names are transposed. This is a bit distracting from our larger point here, but this observation highlights another of the challenges of blocking and deduplication in general. Our deduplication will only be as good as the training data. In particular, we need lots and lots of diverse approaches to gathering training data.

For the Syria project, we have over 215,000 pairs a human has labeled as matches or non-matches. If a person can label a match in 2 seconds, that's around 1,200 hours of work that our matchers have done!

Our data partners for this project include the Violations Documentation Centre, the Syrian Network for Human Rights, the Syrian Center for Statistics and Research, and the Damascus Center for Human Rights Studies. We admire and are deeply grateful to these groups for their work. We get lists of name of victims from our partners, and our team of matchers creates the training data by labeling pairs of records as matches or non-matches.

We are constantly thinking of new ways to generate pairs for our matcher to label. The sheer volume of training data does not guarantee that it is representative, but we've put a lot of effort into finding diverse pairs to label. At this point, we're reasonably confident that most kinds of matches are represented in the training data.

In the second part of this post, I'll talk about creating useful fields for blocking, and how to setup the training data. In the third part, I get to the good stuff, showing one adaptive blocking approach that uses the training data to create a scheme that maximizes both PC and RR. In the fourth part, I'll show how to generate pairs efficiently. Then in the last part, I'll talk about a few extensions to this approach and some gotchas to watch for.

Keep in mind that blocking is only the first part of database deduplication. Our framework starts with blocking, but it has a lot of other steps, shown below in Figure 2. At some point, I'll write a post about how all the pieces fit together. A description of an earlier iteration of our matching approach is here. Meanwhile, Peter Christen's book can provide a great starting point.

Figure 2: HRDAG's database deduplication framework

In [6]: