In [1]:
from IPython.display import Image

How do we find duplicates among multiple, giant datasets?

Implementing adaptive blocking for database deduplication, part 5

In part 1 of this post, I described the problem of database deduplication. We need to compare every record in the database to every other record, and for databases of substantial size, there are simply too many records to address directly. We use blocking techniques to reduce the number of pairs to consider, the candidate set of pairs. In part 2 , I showed how to set up the data so that we can search for rules defining the optimum candidate set. In part 3, I showed how to do the search, and in part 4, I showed how to generate pairs for the candidate set efficiently from the rules.

Here in part 5, I'll talk about the implications of this work, and give some pointers on things to watch out for.

Substantive concerns

When we're putting together a blocking scheme, the overriding requirement is always to be sure that we're including all (or very nearly all) the pairs that have any chance at all of being matches. Therefore it's very important to think about the fields that get used in the rules: do the rules have a mix of fields? Are the rules general enough to cover obvious cases?

The most important consideration in this step is to pay attention to the pairs that don't get covered. That is, remember the last 384 pairs that the optimized disjunction of conjunctions couldn't cover: They're interesting! Why don't they get covered? They are likely to be somehow different from the pairs that do get covered. It's dangerous to think that the existing training pairs are perfectly representative of all the positive matches in the full data.

It's a great idea to search for more pairs that are like the ones that are not covered, and refer those pairs for additional human labeling. We have found repeatedly that reviewing pairs by hand is essential for developing a good model.

Technical thoughts

One completely different approach for finding an optimal blocking scheme uses genetic programming (GP) to evolve an optimal scheme. Evangenlista et al. (2010) present an implementation of GP to the blocking problem, and they report substantially better results than would be found by either the Bilenko et al. (2006) or Michelson and Knoblock (2006) methods.

It seems to me that it would be relatively easy to implement evolutionary blocking using the python package DEAP by Fortin, et al. (2012). In DEAP's language, the single-field blocking rules are primitives, while the and () and or () connectors are operators.

There are two major advantages that occur to me for evolutionary blocking relative to the implementation given in the previous parts. First, this approach would not need to pre-generate the multi-field conjunctions that I spent part 2 defining. Instead it would begin with random conjunctions and evolve better schemes directly.

Second, the evolutionary approach could generate conjunctions of arbitrary complexity. The direct approach is limited to 3-field conjunctions because there are too many conjunctions of more fields to consider. The number of possible multi-field conjunctions is very large, and generating 3-field conjunctions proved to be the largest possible. Given the 35 fields considered here, there are 595 2-field conjunctions, 6545 3-field conjunctions, and 52360 4-field conjunctions. The direct approach used here cannot consider conjunctions with more than 3 fields because it would take too long to test them all.

Note that all the conjunctions found in part 3 were three-field conjunctions. I suspect that some four-field conjunctions, or even bigger ones, would be helpful, but it's computationally impractical to try them all. GP might be able to find them, and so this strikes me as a promising direction for future research.


All this work for only one step, blocking. However, by reducing the number of pairs that get considered, blocking makes the subsequent steps -- compare, classify, cluster, and post-processing -- a whole lot easier.

I haven't discussed the training loop (TS-import, TS-integrate, TS-compare, TS-train, and TS-draw) at all. I haven't talked about how we train a model, classify pairs, or cluster the pairs into matched groups. These are the more substantively challenging parts of database deduplication. More to come!

In [3]:


I am very grateful to my HRDAG colleagues Dr Megan Price, Dr Anita Gohdes, Dr Jeff Klingner, and Dr Scott Weikart for their thoughtful and meticulous comments on these notes.

In [ ]: