from IPython.display import Image
Image(filename='banner.png')
by Patrick Ball
In the previous chapter in this post, I described the problem of database deduplication. We need to compare every record in the database to every other record, and for databases of substantial size, there are simply too many records to address directly. We use blocking techniques to reduce the number of pairs to consider, the candidate set of pairs.
In this chapter, I'll describe how to prepare the data and the training data for blocking. I'm not going to share the data because some of it is confidential. In part 1, I pointed to the original data sources, and some of them are public. It's hard to extract a subset of non-confidential records that are still meaningful for blocking in the scheme I'm working on here, so I'm going to focus on the process. I'll show the structure of the data, and in part 4, I'll dig into the content of one record.
Let's look first at the data. Here's the data I'm starting with:
# First, here is the data to be matched
import pandas as pd
records = pd.read_pickle("rp.pkl")
# records.info()
Let's take the fields in groups:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 361649 entries, 0 to 361648
Data columns (total 38 columns):
hashid 361649 non-null object
name 361649 non-null object
name_en 361649 non-null object
sex 357970 non-null object
age 65960 non-null object
age_group 55385 non-null object
date_of_death 361649 non-null object
governorate 361649 non-null object
location 260128 non-null object
The hashid
field is our unique identifier for each record. In an earlier stage of the processing, we use R's digest function to calculate a sha1
on the most important fields we use for matching. We do this for each of the input records. When we get updates from the data partners, the hashid
is still valid for the records that have not changed in any of the fields we use, but it changes when the data partners update any of the fields we're watching. This is important because our training data depends on the hashid
of the original records.
# R code:
hashfields <- c("dataset", "record_id", "name", "province", "date_of_death")
The name
field is the originally recorded name, in Arabic, and the name_en
is the same field run through google translate. Sex
, date_of_death
, and age
are what they seem, and age_group
is one of child, adult, elderly. The governorate
is the Syrian province in which the killing occurred, and the location
is descriptive text that explains a bit more about where it happened.
Here's the next block:
name_en_no_mo 361649 non-null object
sortedname 361649 non-null object
sortedname_en 361649 non-null object
The first field, name_en_no_mo
, contains the name with variants of the name "Mohammed" removed. This name is very common among Syrian men, so it tends to create too many false links. The sortedname
field takes all the name pieces (first name, family name, other names) and sorts them lexigraphically. This addresses the sometimes-jumbled combinations of names.
The next block looks at subsets of names, prefixes and suffixes:
name_first5 361649 non-null object
name_last5 361649 non-null object
sortedname_first5 361649 non-null object
sortedname_last5 361649 non-null object
name_en_first4 361649 non-null object
name_en_last4 361649 non-null object
sortedname_en_first5 361649 non-null object
sortedname_en_last5 361649 non-null object
Metaphones convert text fields into a sequence of phonemes that capture how the names sound when pronounced out loud. This is useful for finding matches between names that are spelled differently, but in ways that preserve the pronunciation, e.g. Jane / Jayne. I use this implementation. We're not sure if metaphones work properly with Arabic (see this warning), but whatever they're doing, sometimes the blocks work. In blocking, it's not at all important whether the blocks have any substantive meaning. All that matters is whether the resulting blocks group the greatest possible number of trained pairs together. Here are the names of variables that contain metaphone calculations.
name_meta_first 361649 non-null object
name_meta_last 361649 non-null object
sortedname_meta_first 361649 non-null object
sortedname_meta_last 361649 non-null object
name_en_meta_first 361649 non-null object
name_en_meta_last 361649 non-null object
sortedname_en_meta_first 361649 non-null object
sortedname_en_meta_last 361649 non-null object
One of the biggest challenges in information retrieval generally is grouping strings that are somehow similar to each other without comparing the strings directly. This is the problem we're managing here, but it's a general problem. I like a technique called locality sensitive hashing (LSH). Explaining LSH is well beyond the scope of this post, but see Rajaraman and Ullman, Mining of Massive Datasets (2011), especially ch.3.
For an application of LSH to blocking, see Steorts et al., Blocking Methods Applied to Casualty Records from the Syrian Conflict (2015). Side note: although I very happily collaborate with Steorts et al., I developed the implementation of LSH used here independently; that is, these are not the same fields that Steorts et al. have reported elsewhere.
In the section below, I have pre-computed LSH blocks for the fields for the name in Arabic, sorted name in Arabic, name in English, sorted name in English, and the name in English after deleting the string "Mohammed."
name_lsh 361649 non-null int64
sortedname_lsh 361649 non-null int64
name_en_lsh 361649 non-null int64
sorted_name_en_lsh 361649 non-null int64
name_en_no_mo_lsh 361649 non-null int64
Finally, we include fields that are subsections of other fields: year
and yearmo
are pieces of the date; region
is a purely geographic grouping. We do not mean to imply anything about connections by ethnic group, politics, history, armed group control, or anything else substantively interesting about space.
year 361649 non-null object
yearmo 361649 non-null object
region 361649 non-null object
Finally, let's get the training data, the pairs of records that a human being has reviewed and labeled as a match or non-match.
# Next, here is the training data
pairs = pd.read_csv("../input/training-pairs.csv", sep='|')
pairs = pairs[pairs.match == 'Y'] # only use the matches, the labeled non-matches are ignored
pairs = pairs[['hash1', 'hash2']] # we only need the id's, we can discard the metadata
Our first task is to make a table which includes all the data for all the pairs we're reviewing. Each field in the records
table above will have two copies, field_1
and field_2
, which contain the values for that field from the records pointed to by hash1
and hash2
.
record_pairs = pd.merge(records, pairs, left_on='hashid', right_on='hash1')
record_pairs = pd.merge(records, record_pairs, left_on='hashid', right_on='hash2',
suffixes=('_1', '_2'))
record_pairs.drop(['hashid_1', 'hashid_2'], axis=1, inplace=True)
# record_pairs.info()
The result of this join is a dataframe like this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Data columns (total 76 columns):
name_1 105248 non-null object
name_en_1 105248 non-null object
sex_1 104972 non-null object
<snip> ...
name_en_lsh_2 105248 non-null int64
sorted_name_en_lsh_2 105248 non-null int64
name_en_no_mo_lsh_2 105248 non-null int64
hash1 105248 non-null object
hash2 105248 non-null object
dtypes: int64(10), object(66)
memory usage: 61.8+ MB
To know if a record belongs to a block defined by the LSH value calculated to group records on the name in English with the string "Mohammed" removed, we would compare name_en_no_mo_lsh_1
to name_en_no_mo_lsh_2
. Now just a tiny bit more setup.
# here's a list of the columns we'll use in the blocking assessment.
cols_to_use = ['name', 'name_en', 'sex', 'age', 'age_group', 'date_of_death',
'governorate', 'location',
'sortedname', 'sortedname_en', 'name_en_no_mo',
'name_first5', 'name_last5', 'sortedname_first5', 'sortedname_last5',
'name_en_first4', 'name_en_last4', 'sortedname_en_first5', 'sortedname_en_last5',
'name_meta_first', 'name_meta_last', 'sortedname_meta_first',
'sortedname_meta_last', 'name_en_meta_first', 'name_en_meta_last',
'sortedname_en_meta_first', 'sortedname_en_meta_last',
'name_lsh', 'sortedname_lsh', 'name_en_lsh',
'sorted_name_en_lsh', 'name_en_no_mo_lsh',
'year', 'yearmo', 'region']
assert set(records.columns) > set(cols_to_use)
print(len(cols_to_use))
def human_format(num):
''' helper function to see big numbers more easily '''
num = int(num)
magnitude = 0
while num >= 1000:
magnitude += 1
num /= 1000.0
return '%.2f%s' % (num, ['', 'K', 'M', 'G', 'T', 'P'][magnitude])
Now we turn to the training data to check it against the possible rules. The idea is to generate a lot of blocks, which we've done above with 35 columns. Each of the columns by itself can be a rule (though probably not a very good one). We start by asking, for each column, and for each pair in the training data, do the values from the column match? That is, each pair suggests a value for a column from each record in the pair. If they match, this pair is included in the candidate pairs generated by this rule.
import time
start = time.time()
rule_data = dict() # rules[rule_id]: {'pairscount': i, 'cols': []}
cols2rule_id = dict() # cols2rule_id[col]: rule_id
rule_ctr = 0 # keeps track of what rule we're defining.
for col in cols_to_use:
rule_ctr += 1
rule_id = "r{}".format(rule_ctr) # r1, r2, ..., r_j, j=len(cols_to_use)
# first just setup the new rule. We'll count the number of pairs in the raw
# data that share a value in this col. pandas' groupby function creates blocks.
gx = records.groupby(col)
# remember the number of pairs created by the blocks from this column
# btw, the gx.size() function is an iterator of sizes for each blocking field(s).
# it's *fast*
paircount = sum([(s * (s - 1) / 2) for s in gx.size()])
rule_data[rule_id] = {'paircount': paircount, 'cols': [col]}
rule_id_key = frozenset([col]) # rule_id_key is a list of cols
cols2rule_id[rule_id_key] = rule_id
# given this rule, for each training pair,
# figure out if both records in the pair are in the same block.
#
col_1 = "{}_1".format(col)
col_2 = "{}_2".format(col)
# each pair is in the same block if the two columns are the same
record_pairs[rule_id] = record_pairs[col_1] == record_pairs[col_2]
# we don't need the actual field values after this.
record_pairs.drop([col_1, col_2], axis=1, inplace=True)
end = time.time()
print("time = {:2.0f}s".format(end - start)) # time = 6s
record_pairs[0:3][['hash1', 'hash2', 'r1', 'r2', 'r4']]
Now we have a column for each field that defines, for each pair, whether that pair shares a value (i.e., is in the same block).
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Data columns (total 37 columns):
hash1 105248 non-null object
hash2 105248 non-null object
r1 105248 non-null bool
r2 105248 non-null bool
<snip>
r35 105248 non-null bool
dtypes: bool(35), object(2)
memory usage: 5.9+ MB
The trick is now to use the logical information in these columns to define blocks that are conjunctions, that is, combinations of columns.
The next step is pretty fast, but it's going to run over 7140 combinations of fields (see below), so it takes about 30-45 minutes (with my data). Note that we don't have to go back to the data for these calculations. At this point, we're just combining the one-column rules we generated in the previous step.
import itertools as it
len(list(it.combinations(cols_to_use, 2))) + len(list(it.combinations(cols_to_use, 3)))
import numpy as np
start = time.time()
rule_ctr = len(cols_to_use) - 1
for num_cols in [2,3]:
for cols in it.combinations(cols_to_use, num_cols):
rule_ctr += 1
rule_id = "r{}".format(rule_ctr)
# this is the same logic to count the pairs this block will generate
# as in the previous loop
gx = records.groupby(cols)
paircount = int(sum([(s * (s - 1) / 2) for s in gx.size()]))
rule_data[rule_id] = {'paircount': paircount, 'cols': cols}
rule_id_key = frozenset(cols)
cols2rule_id[rule_id_key] = rule_id
# here we generate a rule_vec, which will contain True for each
# pair that is covered by this rule, and False otherwise.
#
# numpy is 3x faster than the more obvious pandas approach
# pairs[rule_id] = pairs[cols].apply(lambda r: all(r), axis=1)
#
# the col-by-col approach below is 35% faster than numpy's alltrue
# pairs[rule_id] = np.alltrue(pairs[cols], axis=1)
#
# start by getting the rule_ids for the first two columns
rule_id0 = cols2rule_id[frozenset([cols[0]])]
rule_id1 = cols2rule_id[frozenset([cols[1]])]
# combine these two columns by _and_
rule_vec = np.logical_and(record_pairs[rule_id0], record_pairs[rule_id1])
# then _and_ each successive column to the one we just created
for col in cols[2:]:
rule_id_x = cols2rule_id[frozenset([col])]
rule_vec = np.logical_and(rule_vec, record_pairs[rule_id_x])
record_pairs[rule_id] = rule_vec
print("finished num_cols = {}".format(num_cols))
end = time.time()
print("time = {:2.0f}s".format(end - start)) # 2086s
import pickle
record_pairs.to_pickle("record_pairs.pkl")
pickle.dump(rule_data, file=open("rule_data.pkl", 'wb'))
pickle.dump(cols2rule_id, file=open("c2ri.pkl", 'wb'))
record_pairs.info()