from IPython.display import Image
Image(filename='banner.png')
by Patrick Ball
In part 1 of this post, I described the problem of database deduplication. We need to compare every record in the database to every other record, and for databases of substantial size, there are simply too many records to address directly. We use blocking techniques to reduce the number of pairs to consider, the candidate set of pairs. In part 2, I showed how to set up the data so that we can search for the optimum candidate set. In this part, I'll show how to do the search.
First let's define a helper function, then let's get the data we created in the previous step.
def human_format(num):
''' print big numbers in a readable way '''
num = int(num)
magnitude = 0
while num >= 1000:
magnitude += 1
num /= 1000.0
# add more suffixes if you need them
return '%.2f%s' % (num, ['', 'K', 'M', 'G', 'T', 'P'][magnitude])
import pandas as pd
import pickle
record_pairs = pd.read_pickle("record_pairs.pkl") # these are the pairs
rule_data = pickle.load(open("rule_data.pkl", 'rb')) # this maps the rule names to databases columns
cols2rule_id = pickle.load(open("c2ri.pkl", "rb" )) # this maps the database columns to a rule name
# this is what the rules look like: a pair of database ids,
# then there are a series of columns r1, r2, ..., r6578 that
# tell us for each i, if rule_i covers this pair.
print(record_pairs.info())
record_pairs.ix[0:5, 0:5]
We have 6578 possible blocking rules. At this point, we don't need to worry about what they mean. We just need to come up with a way to find the optimum disjunction of rules. Here's the algorithm we'll use, in pseudocode:
while we still have pairs
with all the pairs we have, pick the 'best' rule
add the best rule to a list of rules to use
remove all the pairs covered by the best rule
The real challenge here is to define the best rule. This point determines the outcome of the optimization. How do we balance pairs completeness and the reduction rate? I explored a number of different approaches, and it's worth talking about them before diving in.
I define the best rule as the rule that maximizes the gain that this rule has relative to the current state of the rules. Gain means some measure of how well a rule covers all the positive-match training pairs while at the same time covering as few as possible total pairs or non-matching training pairs. There are several ways to define gain.
One way to define the gain requires that in the training data, we retain both the positive matches (which I do here) and the negative matches, that is, the pairs that a human labeler has defined as not a match (which I'm ignoring in my approach). The key article that defines this approach is Bilenko et al., "Adaptive Blocking: Learning to Scale Up Record Linkage" (2006).
(Two asides: I really enjoyed studying the implementation of this approach in a package called dedupe (2014) written by F. Gregg and D. Eder from Datamade. And Bilenko et al. use a tree-based search rather than a simple while loop, but the logic is similar.)
In this approach, imagine a set of training pairs . Some rule covers a subset of the pairs in ; call the covered subset . In , the pairs that are labeled as a match are and those that are labeled as a non-match are . The gain of a rule is therefore
The idea comes from an assumption that in some statistical sense, the training pairs represent the universe of all possible pairs in the data--not just the training pairs, but all the pairs. In particular, the non-match pairs represent the massive number of non-matching pairs in the full set of all possible pairs. Therefore the covered non-matching pairs in the covered subset, , tell us about how many non-matching (and therefore unneeded) pairs are likely to be included in the candidate set generated by . It took me a while to figure this out.
In the i-th iteration, the algorithm chooses the rule that maximizes gain with the current set of training pairs . Then it discards (the pairs covered by ), . Then back to recalculating gain for all the possible rules given . Repeat until some stopping point.
By choosing a rule that maximizes the ratio of the matching to non-matching pairs, we're more or less directly optimizing a balance of pairs completeness and reduction ratio.
I found that this approach produced very good results. However, it required that I retain the non-matching pairs, and that I do calculations with those pairs. That doubled the computation time required by the setup step, which is significant. It also doubled the time in the optimization steps, but they are very fast so that's not important. In other approaches, especially the genetic programming approach described in part 5, using this definition of gain is definitely the right way to do it.
Another way to balance pairs completeness and the reduction ratio is to use them directly, which is the approach proposed by Michelson and Knoblock (2006). Recall that with some set of training data, for some rule, pairs completeness (PC) is the number of matched pairs covered by the rule divided by the total number of matched pairs (which is , that is, before anything has been discarded):
For the same rule and the same set of pairs, the reduction rate (RR) is the complement of the number of pairs generated by the rule divided by the total number of possible pairs (remember that the number of total records we're deduplicating is ):
I slightly modified Michelson and Knoblock's approach by changing the denominator of the RR, and by defining gain as the mean of the two measures:
As you may recall from part 1 of this post, there are about 65 billion possible pairs in the Syria data. Instead of considering 65 billion as the denominator, I chose 100 million as the largest number of pairs I would accept. This value is called toobig
in the code below (100e6
in notation).
from collections import namedtuple
def add_step_compute_rule(remaining_pairs, rule2data, rule_q):
''' finds best next step
params:
-- remaining_pairs = the remaining pairs
-- rule2data[rule_id] = {paircount: someint}
-- rule_q = list of previously found rule_ids
returns: rule
'''
toobig = 100e6 # if a rule includes more than toobig pairs, we ignore it.
# it's nice to store each rule's information in a namedtuple so that you can
# watch the rule being sought. good for debugging, and for comparing differnet
# ways of defining gain.
result = namedtuple('Result', ['gain', 'rule_id', 'pos', 'cols', 'paircount'])
rule_set = set(rule_q)
results = list()
for rule_id in rule_data:
# ignore rules that have too many pairs
if rule_data[rule_id]['paircount'] > toobig:
continue
# ignore rules we've seen before
if rule_id in rule_set:
continue
# paircount is the total pairs generated by each rule
# we recorded paircount for every rule in the setup step.
paircount = float(rule_data[rule_id]['paircount'])
RR = 1 - paircount/toobig
# note this is the fastest way to count True in a vector or pandas col.
positive_pairs = float(np.count_nonzero(remaining_pairs[rule_id]))
PC = positive_pairs/len(remaining_pairs)
gain = (RR + PC)/2
this_rule = result(gain=gain, rule_id=rule_id, pos=positive_pairs,
cols=rule_data[rule_id]['cols'],
paircount=human_format(paircount))
results.append(this_rule)
results = sorted(results, key=itemgetter(0), reverse=True)
return results[0].rule_id
So add_step_compute_rule
picks the best rule for the next step. Here's how to use the function above to find the best combination of rules. The answer is printed below.
import time
import numpy as np
from operator import itemgetter
import copy
start = time.time()
remaining_pairs = record_pairs.copy(deep=True)
toobig = 100e6 # 100M; in real code, we wouldn't define this twice!
total_pairs = 0
rule_q = list()
uncovered_cnt = None
# results[[step, nRR, nPC]]
results = [(0, 0.0, 1.0)] # we'll keep track of these just to graph them
ctr = 0
# here I've decided to limit the search to fewer than 100 rules and
# no more than `toobig` pairs.
while len(rule_q) < 100 and total_pairs < toobig:
ctr += 1
print("** next round has {} recs to cover; {} pairs needed so far.".format(
len(remaining_pairs), human_format(total_pairs)))
next_rule = add_step_compute_rule(remaining_pairs, rule_data, rule_q)
rule_q.append(next_rule)
# now *drop* all the covered pairs. this means *keep* the records that are
# *not* covered by the rule we just found.
uncovered_vec = np.logical_not(remaining_pairs[next_rule])
remaining_pairs = remaining_pairs.loc[uncovered_vec]
total_pairs += rule_data[next_rule]['paircount']
nRR = total_pairs/toobig # nRR is 1 - (reduction rate)
if uncovered_cnt is len(remaining_pairs):
print("no progress, breaking." )
break
uncovered_cnt = len(remaining_pairs)
nPC = float(uncovered_cnt)/len(record_pairs) # nPC is 1 - (pairs completeness)
results.append((ctr, nRR, nPC))
rule_name = rule_data[next_rule]['cols']
end = time.time()
print("time is = {:2.2f}s".format(end - start))
# let's have a look at the progress
#
%matplotlib inline
import matplotlib.pyplot as plt
r = pd.DataFrame.from_records(results, columns=['step', 'nRR', 'nPC'])
fig = plt.figure()
fig.suptitle("Figure 1: Total pairs (/100M, blue) vs uncovered examples (/105K, green), by rule",
fontsize=16)
axes = fig.add_subplot(111)
plt.xlabel("rule")
axes.set_yscale('log')
axes.plot(r.step, r.nRR, color='blue')
axes.plot(r.step, r.nPC, color='green')
plt.show()
Figure 1 above shows the relationship, as each new rule is added, between the total number of candidate pairs generated (in blue) and the number of labeled matches that are not yet covered (in green). The number of candidate pairs is expressed as a fraction of 100 million, and the labeled matches not covered are expressed as a fraction of the total number of labeled matches. Note that the vertical axis is in the log scale.
The point of this graph is that these two measures trade off. As we add candidate pairs, we can cover more of the labeled matches. We want the green line to go to zero, and the blue line to stay as low as possible. It's clear that there's not much improvement in the green line after about the fourth or fifth rule.
# this bit of code generates a LaTeX representation of the rules
conjs = []
for rule_id in rule_q:
conj = "({})".format(' \wedge '.join(rule_data[rule_id]['cols']))
conj = conj.replace("_", "\_")
conjs.append(conj)
scheme = ' \\vee \\\\ '.join(conjs)
del conjs, conj
# print(scheme)
# print(rule_q)
Using 20 conjunctions, this algorithm was unable to find 394 pairs. That means it managed to cover 104855 out of 105248 positive matches with about 48.6 million candidate pairs. It uses a rich combination of name, date, and location variables to get here. This is a great outcome! The final rules are shown below.
Most of the coverage is in the first few rules. The last few rules don't add much additional pairs completeness, and they add a lot of additional pairs. We might want to cut off the rules after step 9. At that point, we covered 104797/105248 positive pairs with about 41 million pairs.
This is the basic approach. In the next post, I'll explain how to efficiently turn this result into pairs. There are many extremely inefficient ways to do it, and a few somewhat inefficient ways. The Dedoop project has recently published an algorithm for emitting the pairs very efficiently, and I'll show a python implementation of the Dedoop approach.
In the final post, I'll wrap up with a discussion of what this result means, and what kinds of thinking it should trigger in real-world work.