In [1]:
from IPython.display import Image

How do we find duplicates among multiple, giant datasets?

Implementing adaptive blocking for database deduplication, part 3

In part 1 of this post, I described the problem of database deduplication. We need to compare every record in the database to every other record, and for databases of substantial size, there are simply too many records to address directly. We use blocking techniques to reduce the number of pairs to consider, the candidate set of pairs. In part 2, I showed how to set up the data so that we can search for the optimum candidate set. In this part, I'll show how to do the search.

First let's define a helper function, then let's get the data we created in the previous step.

In [2]:
def human_format(num):
    ''' print big numbers in a readable way '''
    num = int(num)
    magnitude = 0
    while num >= 1000:
        magnitude += 1
        num /= 1000.0
    # add more suffixes if you need them
    return '%.2f%s' % (num, ['', 'K', 'M', 'G', 'T', 'P'][magnitude])
In [3]:
import pandas as pd
import pickle

record_pairs = pd.read_pickle("record_pairs.pkl")    # these are the pairs
rule_data = pickle.load(open("rule_data.pkl", 'rb')) # this maps the rule names to databases columns
cols2rule_id = pickle.load(open("c2ri.pkl", "rb" ))  # this maps the database columns to a rule name

# this is what the rules look like: a pair of database ids, 
# then there are a series of columns r1, r2, ..., r6578 that 
# tell us for each i, if rule_i covers this pair. 
record_pairs.ix[0:5, 0:5]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Columns: 7176 entries, hash1 to r7174
dtypes: bool(7174), object(2)
memory usage: 722.5+ MB
hash1 hash2 r1 r2 r3
0 026d2c7778d93df3f48fa45b1cacda444a70617c 04d1ae98b97df6f7caca7d215df7c4fb46df861d False False False
1 03eeb4f97354afec6d0b3ae24ec94127fadf3acf 076190a0722517a93ebe0c9d516b34b71f562053 True True True
2 004ed9c0ad4129fe5caee0bfd4919d6ea9aa4983 08a385a1a8908add2296c91856ffc98b732ea134 True True True
3 0451edf6591ceef32298efd90d1dee6b2355237c 08de574296490085aa4e1983762bc5a319c2ac85 True True True
4 057db6288401d07b1f8c7b837feb0d1bbf28427a 09544ca1257d99907c1dbd0aa07569b92c35aad2 True True True
5 02ee43b6933a9ce6c6f2a9ab723e2b138344997c 0a444a192323dc8183c8c32035b8933d9da5daaf True True False

We have 6578 possible blocking rules. At this point, we don't need to worry about what they mean. We just need to come up with a way to find the optimum disjunction of rules. Here's the algorithm we'll use, in pseudocode:

while we still have pairs
   with all the pairs we have, pick the 'best' rule
   add the best rule to a list of rules to use
   remove all the pairs covered by the best rule

The real challenge here is to define the best rule. This point determines the outcome of the optimization. How do we balance pairs completeness and the reduction rate? I explored a number of different approaches, and it's worth talking about them before diving in.

Defining the "best" rule

I define the best rule as the rule that maximizes the gain that this rule has relative to the current state of the rules. Gain means some measure of how well a rule covers all the positive-match training pairs while at the same time covering as few as possible total pairs or non-matching training pairs. There are several ways to define gain.

One way to define the gain requires that in the training data, we retain both the positive matches (which I do here) and the negative matches, that is, the pairs that a human labeler has defined as not a match (which I'm ignoring in my approach). The key article that defines this approach is Bilenko et al., "Adaptive Blocking: Learning to Scale Up Record Linkage" (2006).

(Two asides: I really enjoyed studying the implementation of this approach in a package called dedupe (2014) written by F. Gregg and D. Eder from Datamade. And Bilenko et al. use a tree-based search rather than a simple while loop, but the logic is similar.)

In this approach, imagine a set of training pairs $P$. Some rule $R$ covers a subset of the pairs in $P$; call the covered subset $p$. In $p$, the pairs that are labeled as a match are $p^+$ and those that are labeled as a non-match are $p^-$. The gain of a rule $R$ is therefore

$gain(R, P) = \frac{p^+}{p^-}$

The idea comes from an assumption that in some statistical sense, the training pairs $P$ represent the universe of all possible pairs in the data--not just the training pairs, but all the pairs. In particular, the non-match pairs $P^-$ represent the massive number of non-matching pairs in the full set of all possible pairs. Therefore the covered non-matching pairs in the covered subset, $p^-$, tell us about how many non-matching (and therefore unneeded) pairs are likely to be included in the candidate set generated by $R$. It took me a while to figure this out.

In the i-th iteration, the algorithm chooses the rule that maximizes gain with the current set of training pairs $P_i$. Then it discards $p_i$ (the pairs covered by $R_i$), $P_{i+1} = P_i - p_i$. Then back to recalculating gain for all the possible rules given $P_{i+1}$. Repeat until some stopping point.

By choosing a rule that maximizes the ratio of the matching to non-matching pairs, we're more or less directly optimizing a balance of pairs completeness and reduction ratio.

I found that this approach produced very good results. However, it required that I retain the non-matching pairs, and that I do calculations with those pairs. That doubled the computation time required by the setup step, which is significant. It also doubled the time in the optimization steps, but they are very fast so that's not important. In other approaches, especially the genetic programming approach described in part 5, using this definition of gain is definitely the right way to do it.

What I actually did

Another way to balance pairs completeness and the reduction ratio is to use them directly, which is the approach proposed by Michelson and Knoblock (2006). Recall that with some set of training data, for some rule, pairs completeness (PC) is the number of matched pairs covered by the rule divided by the total number of matched pairs (which is $P_0$, that is, $P$ before anything has been discarded):

$PC_i = \frac{\|p_i^+\|}{\|P_0^+\|}$

For the same rule and the same set of pairs, the reduction rate (RR) is the complement of the number of pairs generated by the rule divided by the total number of possible pairs (remember that the number of total records we're deduplicating is $n$):

$RR_i = 1 - \frac{\|R_i\|}{n(n-1)/2}$

I slightly modified Michelson and Knoblock's approach by changing the denominator of the RR, and by defining gain as the mean of the two measures:

$gain(R_i, P_i) = \frac{PC_i + RR_i}{2}$

As you may recall from part 1 of this post, there are about 65 billion possible pairs in the Syria data. Instead of considering 65 billion as the denominator, I chose 100 million as the largest number of pairs I would accept. This value is called toobig in the code below (100e6 in notation).

In [4]:
from collections import namedtuple

def add_step_compute_rule(remaining_pairs, rule2data, rule_q): 
    ''' finds best next step 
          -- remaining_pairs = the remaining pairs
          -- rule2data[rule_id] = {paircount: someint}
          -- rule_q = list of previously found rule_ids 
        returns: rule
    toobig = 100e6  # if a rule includes more than toobig pairs, we ignore it.
    # it's nice to store each rule's information in a namedtuple so that you can 
    # watch the rule being sought. good for debugging, and for comparing differnet 
    # ways of defining gain. 
    result = namedtuple('Result', ['gain', 'rule_id', 'pos', 'cols', 'paircount'])
    rule_set = set(rule_q)
    results = list() 
    for rule_id in rule_data:
        # ignore rules that have too many pairs
        if rule_data[rule_id]['paircount'] > toobig:
        # ignore rules we've seen before 
        if rule_id in rule_set:
        # paircount is the total pairs generated by each rule
        # we recorded paircount for every rule in the setup step.
        paircount = float(rule_data[rule_id]['paircount'])
        RR = 1 - paircount/toobig
        # note this is the fastest way to count True in a vector or pandas col.
        positive_pairs = float(np.count_nonzero(remaining_pairs[rule_id]))
        PC = positive_pairs/len(remaining_pairs)
        gain = (RR + PC)/2

        this_rule = result(gain=gain, rule_id=rule_id, pos=positive_pairs, 
    results = sorted(results, key=itemgetter(0), reverse=True)
    return results[0].rule_id

So add_step_compute_rule picks the best rule for the next step. Here's how to use the function above to find the best combination of rules. The answer is printed below.

In [5]:
import time
import numpy as np
from operator import itemgetter
import copy

start = time.time()
remaining_pairs = record_pairs.copy(deep=True)

toobig = 100e6  # 100M; in real code, we wouldn't define this twice! 
total_pairs = 0 
rule_q = list()
uncovered_cnt = None
# results[[step, nRR, nPC]]  
results = [(0, 0.0, 1.0)]  # we'll keep track of these just to graph them
ctr = 0

# here I've decided to limit the search to fewer than 100 rules and 
# no more than `toobig` pairs. 
while len(rule_q) < 100 and total_pairs < toobig:
    ctr += 1 
    print("** next round has {} recs to cover; {} pairs needed so far.".format(
            len(remaining_pairs), human_format(total_pairs)))
    next_rule = add_step_compute_rule(remaining_pairs, rule_data, rule_q)
    # now *drop* all the covered pairs. this means *keep* the records that are 
    # *not* covered by the rule we just found. 
    uncovered_vec = np.logical_not(remaining_pairs[next_rule])
    remaining_pairs = remaining_pairs.loc[uncovered_vec]

    total_pairs += rule_data[next_rule]['paircount']
    nRR = total_pairs/toobig   # nRR is 1 - (reduction rate)

    if uncovered_cnt is len(remaining_pairs):
        print("no progress, breaking." )
    uncovered_cnt = len(remaining_pairs) 
    nPC = float(uncovered_cnt)/len(record_pairs)  # nPC is 1 - (pairs completeness) 
    results.append((ctr, nRR, nPC))
    rule_name = rule_data[next_rule]['cols']

end = time.time()
print("time is = {:2.2f}s".format(end - start)) 
** next round has 105248 recs to cover; 0.00 pairs needed so far.
** next round has 9909 recs to cover; 2.01M pairs needed so far.
** next round has 1969 recs to cover; 17.27M pairs needed so far.
** next round has 968 recs to cover; 24.87M pairs needed so far.
** next round has 670 recs to cover; 31.05M pairs needed so far.
** next round has 560 recs to cover; 36.06M pairs needed so far.
** next round has 524 recs to cover; 36.67M pairs needed so far.
** next round has 484 recs to cover; 39.45M pairs needed so far.
** next round has 461 recs to cover; 40.83M pairs needed so far.
** next round has 442 recs to cover; 42.51M pairs needed so far.
** next round has 432 recs to cover; 42.94M pairs needed so far.
** next round has 424 recs to cover; 43.32M pairs needed so far.
** next round has 408 recs to cover; 46.35M pairs needed so far.
** next round has 404 recs to cover; 46.83M pairs needed so far.
** next round has 399 recs to cover; 47.76M pairs needed so far.
** next round has 397 recs to cover; 48.00M pairs needed so far.
** next round has 395 recs to cover; 48.39M pairs needed so far.
** next round has 394 recs to cover; 48.62M pairs needed so far.
no progress, breaking.
time is = 8.53s
In [6]:
# let's have a look at the progress
%matplotlib inline

import matplotlib.pyplot as plt
r = pd.DataFrame.from_records(results, columns=['step', 'nRR', 'nPC'])

fig = plt.figure()
fig.suptitle("Figure 1: Total pairs (/100M, blue) vs uncovered examples (/105K, green), by rule",
axes = fig.add_subplot(111)
axes.plot(r.step, r.nRR, color='blue')
axes.plot(r.step, r.nPC, color='green')

Figure 1 above shows the relationship, as each new rule is added, between the total number of candidate pairs generated (in blue) and the number of labeled matches that are not yet covered (in green). The number of candidate pairs is expressed as a fraction of 100 million, and the labeled matches not covered are expressed as a fraction of the total number of labeled matches. Note that the vertical axis is in the log scale.

The point of this graph is that these two measures trade off. As we add candidate pairs, we can cover more of the labeled matches. We want the green line to go to zero, and the blue line to stay as low as possible. It's clear that there's not much improvement in the green line after about the fourth or fifth rule.

In [10]:
# this bit of code generates a LaTeX representation of the rules
conjs = []
for rule_id in rule_q:
    conj = "({})".format(' \wedge '.join(rule_data[rule_id]['cols']))
    conj = conj.replace("_", "\_")

scheme = ' \\vee \\\\ '.join(conjs)
del conjs, conj 
# print(scheme)
# print(rule_q)

Done! We found the optimum rules

Using 20 conjunctions, this algorithm was unable to find 394 pairs. That means it managed to cover 104855 out of 105248 positive matches with about 48.6 million candidate pairs. It uses a rich combination of name, date, and location variables to get here. This is a great outcome! The final rules are shown below.

Disjunction of Conjunctions

$(name\_en\_meta\_first \wedge name\_en\_meta\_last \wedge year) \vee \\ (date\_of\_death \wedge governorate \wedge sorted\_name\_en\_lsh) \vee \\ (governorate \wedge name\_en\_first4 \wedge yearmo) \vee \\ (governorate \wedge name\_last5 \wedge year) \vee \\ (governorate \wedge sortedname\_en\_meta\_last \wedge name\_lsh) \vee \\ (sex \wedge date\_of\_death \wedge name\_last5) \vee \\ (governorate \wedge sortedname\_en\_last5 \wedge yearmo) \vee \\ (date\_of\_death \wedge sortedname\_meta\_first \wedge sortedname\_en\_meta\_last) \vee \\ (location \wedge name\_first5 \wedge year) \vee \\ (sex \wedge sortedname\_en \wedge name\_en\_no\_mo\_lsh) \vee \\ (name\_first5 \wedge name\_en\_last4 \wedge yearmo) \vee \\ (governorate \wedge sortedname\_en\_first5 \wedge sortedname\_en\_last5) \vee \\ (date\_of\_death \wedge sortedname\_en\_meta\_first \wedge sortedname\_lsh) \vee \\ (age\_group \wedge name\_en\_meta\_first \wedge yearmo) \vee \\ (location \wedge sortedname\_en\_meta\_first \wedge sortedname\_en\_meta\_last) \vee \\ (location \wedge name\_en\_meta\_first \wedge yearmo) \vee \\ (governorate \wedge sortedname \wedge yearmo) \vee \\ (name \wedge age \wedge age\_group)$

Most of the coverage is in the first few rules. The last few rules don't add much additional pairs completeness, and they add a lot of additional pairs. We might want to cut off the rules after step 9. At that point, we covered 104797/105248 positive pairs with about 41 million pairs.

This is the basic approach. In the next post, I'll explain how to efficiently turn this result into pairs. There are many extremely inefficient ways to do it, and a few somewhat inefficient ways. The Dedoop project has recently published an algorithm for emitting the pairs very efficiently, and I'll show a python implementation of the Dedoop approach.

In the final post, I'll wrap up with a discussion of what this result means, and what kinds of thinking it should trigger in real-world work.