In [1]:
from IPython.display import Image

How do we find duplicates among multiple, giant datasets?

Implementing adaptive blocking for database deduplication, part 2

In the previous chapter in this post, I described the problem of database deduplication. We need to compare every record in the database to every other record, and for databases of substantial size, there are simply too many records to address directly. We use blocking techniques to reduce the number of pairs to consider, the candidate set of pairs.

In this chapter, I'll describe how to prepare the data and the training data for blocking. I'm not going to share the data because some of it is confidential. In part 1, I pointed to the original data sources, and some of them are public. It's hard to extract a subset of non-confidential records that are still meaningful for blocking in the scheme I'm working on here, so I'm going to focus on the process. I'll show the structure of the data, and in part 4, I'll dig into the content of one record.

Data setup

Let's look first at the data. Here's the data I'm starting with:

In [2]:
# First, here is the data to be matched
import pandas as pd
records = pd.read_pickle("rp.pkl")

Let's take the fields in groups:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 361649 entries, 0 to 361648
Data columns (total 38 columns):
hashid                      361649 non-null object
name                        361649 non-null object
name_en                     361649 non-null object
sex                         357970 non-null object
age                         65960 non-null object
age_group                   55385 non-null object
date_of_death               361649 non-null object
governorate                 361649 non-null object
location                    260128 non-null object    

The hashid field is our unique identifier for each record. In an earlier stage of the processing, we use R's digest function to calculate a sha1 on the most important fields we use for matching. We do this for each of the input records. When we get updates from the data partners, the hashid is still valid for the records that have not changed in any of the fields we use, but it changes when the data partners update any of the fields we're watching. This is important because our training data depends on the hashid of the original records.

# R code:
hashfields <- c("dataset", "record_id", "name", "province", "date_of_death")

The name field is the originally recorded name, in Arabic, and the name_en is the same field run through google translate. Sex, date_of_death, and age are what they seem, and age_group is one of child, adult, elderly. The governorate is the Syrian province in which the killing occurred, and the location is descriptive text that explains a bit more about where it happened.

Here's the next block:

name_en_no_mo               361649 non-null object
sortedname                  361649 non-null object
sortedname_en               361649 non-null object

The first field, name_en_no_mo, contains the name with variants of the name "Mohammed" removed. This name is very common among Syrian men, so it tends to create too many false links. The sortedname field takes all the name pieces (first name, family name, other names) and sorts them lexigraphically. This addresses the sometimes-jumbled combinations of names.

The next block looks at subsets of names, prefixes and suffixes:

name_first5                 361649 non-null object
name_last5                  361649 non-null object
sortedname_first5           361649 non-null object
sortedname_last5            361649 non-null object
name_en_first4              361649 non-null object
name_en_last4               361649 non-null object
sortedname_en_first5        361649 non-null object
sortedname_en_last5         361649 non-null object

Metaphones convert text fields into a sequence of phonemes that capture how the names sound when pronounced out loud. This is useful for finding matches between names that are spelled differently, but in ways that preserve the pronunciation, e.g. Jane / Jayne. I use this implementation. We're not sure if metaphones work properly with Arabic (see this warning), but whatever they're doing, sometimes the blocks work. In blocking, it's not at all important whether the blocks have any substantive meaning. All that matters is whether the resulting blocks group the greatest possible number of trained pairs together. Here are the names of variables that contain metaphone calculations.

name_meta_first             361649 non-null object
name_meta_last              361649 non-null object
sortedname_meta_first       361649 non-null object
sortedname_meta_last        361649 non-null object
name_en_meta_first          361649 non-null object
name_en_meta_last           361649 non-null object
sortedname_en_meta_first    361649 non-null object
sortedname_en_meta_last     361649 non-null object

One of the biggest challenges in information retrieval generally is grouping strings that are somehow similar to each other without comparing the strings directly. This is the problem we're managing here, but it's a general problem. I like a technique called locality sensitive hashing (LSH). Explaining LSH is well beyond the scope of this post, but see Rajaraman and Ullman, Mining of Massive Datasets (2011), especially ch.3.

For an application of LSH to blocking, see Steorts et al., Blocking Methods Applied to Casualty Records from the Syrian Conflict (2015). Side note: although I very happily collaborate with Steorts et al., I developed the implementation of LSH used here independently; that is, these are not the same fields that Steorts et al. have reported elsewhere.

In the section below, I have pre-computed LSH blocks for the fields for the name in Arabic, sorted name in Arabic, name in English, sorted name in English, and the name in English after deleting the string "Mohammed."

name_lsh                    361649 non-null int64
sortedname_lsh              361649 non-null int64
name_en_lsh                 361649 non-null int64
sorted_name_en_lsh          361649 non-null int64
name_en_no_mo_lsh           361649 non-null int64

Finally, we include fields that are subsections of other fields: year and yearmo are pieces of the date; region is a purely geographic grouping. We do not mean to imply anything about connections by ethnic group, politics, history, armed group control, or anything else substantively interesting about space.

year                        361649 non-null object
yearmo                      361649 non-null object
region                      361649 non-null object

Finally, let's get the training data, the pairs of records that a human being has reviewed and labeled as a match or non-match.

In [3]:
# Next, here is the training data
pairs = pd.read_csv("../input/training-pairs.csv", sep='|')
pairs = pairs[pairs.match == 'Y']   # only use the matches, the labeled non-matches are ignored
pairs = pairs[['hash1', 'hash2']]   # we only need the id's, we can discard the metadata

Digging into the work

Our first task is to make a table which includes all the data for all the pairs we're reviewing. Each field in the records table above will have two copies, field_1 and field_2, which contain the values for that field from the records pointed to by hash1 and hash2.

In [4]:
record_pairs = pd.merge(records, pairs, left_on='hashid', right_on='hash1')
record_pairs = pd.merge(records, record_pairs, left_on='hashid', right_on='hash2', 
                        suffixes=('_1', '_2'))
record_pairs.drop(['hashid_1', 'hashid_2'], axis=1, inplace=True)

The result of this join is a dataframe like this:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Data columns (total 76 columns):
name_1                        105248 non-null object
name_en_1                     105248 non-null object
sex_1                         104972 non-null object
<snip> ...
name_en_lsh_2                 105248 non-null int64
sorted_name_en_lsh_2          105248 non-null int64
name_en_no_mo_lsh_2           105248 non-null int64
hash1                         105248 non-null object
hash2                         105248 non-null object
dtypes: int64(10), object(66)
memory usage: 61.8+ MB

To know if a record belongs to a block defined by the LSH value calculated to group records on the name in English with the string "Mohammed" removed, we would compare name_en_no_mo_lsh_1 to name_en_no_mo_lsh_2. Now just a tiny bit more setup.

In [5]:
# here's a list of the columns we'll use in the blocking assessment. 
cols_to_use = ['name', 'name_en', 'sex', 'age', 'age_group', 'date_of_death', 
               'governorate', 'location',
               'sortedname', 'sortedname_en', 'name_en_no_mo',
               'name_first5', 'name_last5', 'sortedname_first5', 'sortedname_last5',
               'name_en_first4', 'name_en_last4', 'sortedname_en_first5', 'sortedname_en_last5', 
               'name_meta_first', 'name_meta_last', 'sortedname_meta_first',
               'sortedname_meta_last', 'name_en_meta_first', 'name_en_meta_last', 
               'sortedname_en_meta_first', 'sortedname_en_meta_last',               
               'name_lsh', 'sortedname_lsh', 'name_en_lsh', 
               'sorted_name_en_lsh', 'name_en_no_mo_lsh',
               'year', 'yearmo', 'region']

assert set(records.columns) > set(cols_to_use)

def human_format(num):
    ''' helper function to see big numbers more easily ''' 
    num = int(num)
    magnitude = 0
    while num >= 1000:
        magnitude += 1
        num /= 1000.0
    return '%.2f%s' % (num, ['', 'K', 'M', 'G', 'T', 'P'][magnitude])

Mapping training pairs to blocks with blocks of one column each

Now we turn to the training data to check it against the possible rules. The idea is to generate a lot of blocks, which we've done above with 35 columns. Each of the columns by itself can be a rule (though probably not a very good one). We start by asking, for each column, and for each pair in the training data, do the values from the column match? That is, each pair suggests a value for a column from each record in the pair. If they match, this pair is included in the candidate pairs generated by this rule.

In [6]:
import time 
start = time.time()

rule_data = dict()        # rules[rule_id]: {'pairscount': i, 'cols': []}
cols2rule_id = dict()     # cols2rule_id[col]: rule_id
rule_ctr = 0              # keeps track of what rule we're defining.

for col in cols_to_use: 
    rule_ctr += 1 
    rule_id = "r{}".format(rule_ctr)  # r1, r2, ..., r_j, j=len(cols_to_use)
    # first just setup the new rule. We'll count the number of pairs in the raw
    # data that share a value in this col. pandas' groupby function creates blocks.
    gx = records.groupby(col)
    # remember the number of pairs created by the blocks from this column
    # btw, the gx.size() function is an iterator of sizes for each blocking field(s).
    # it's *fast* 
    paircount = sum([(s * (s - 1) / 2) for s in gx.size()]) 
    rule_data[rule_id] = {'paircount': paircount, 'cols': [col]}
    rule_id_key = frozenset([col])  # rule_id_key is a list of cols
    cols2rule_id[rule_id_key] = rule_id
    # given this rule, for each training pair, 
    #    figure out if both records in the pair are in the same block. 
    col_1 = "{}_1".format(col)
    col_2 = "{}_2".format(col)
    # each pair is in the same block if the two columns are the same
    record_pairs[rule_id] = record_pairs[col_1] == record_pairs[col_2]
    # we don't need the actual field values after this.
    record_pairs.drop([col_1, col_2], axis=1, inplace=True)

end = time.time()
print("time = {:2.0f}s".format(end - start))  # time = 6s

record_pairs[0:3][['hash1', 'hash2', 'r1', 'r2', 'r4']]
time =  6s
hash1 hash2 r1 r2 r4
0 026d2c7778d93df3f48fa45b1cacda444a70617c 04d1ae98b97df6f7caca7d215df7c4fb46df861d False False False
1 03eeb4f97354afec6d0b3ae24ec94127fadf3acf 076190a0722517a93ebe0c9d516b34b71f562053 True True False
2 004ed9c0ad4129fe5caee0bfd4919d6ea9aa4983 08a385a1a8908add2296c91856ffc98b732ea134 True True False

Now we have a column for each field that defines, for each pair, whether that pair shares a value (i.e., is in the same block).

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Data columns (total 37 columns):
hash1    105248 non-null object
hash2    105248 non-null object
r1       105248 non-null bool
r2       105248 non-null bool
r35      105248 non-null bool
dtypes: bool(35), object(2)
memory usage: 5.9+ MB

The trick is now to use the logical information in these columns to define blocks that are conjunctions, that is, combinations of columns.

Mapping training pairs to blocks with blocks of more than one column

The next step is pretty fast, but it's going to run over 7140 combinations of fields (see below), so it takes about 30-45 minutes (with my data). Note that we don't have to go back to the data for these calculations. At this point, we're just combining the one-column rules we generated in the previous step.

In [7]:
import itertools as it
len(list(it.combinations(cols_to_use, 2))) + len(list(it.combinations(cols_to_use, 3)))
In [8]:
import numpy as np 

start = time.time()
rule_ctr = len(cols_to_use) - 1

for num_cols in [2,3]:
    for cols in it.combinations(cols_to_use, num_cols): 
        rule_ctr += 1 
        rule_id = "r{}".format(rule_ctr)
        # this is the same logic to count the pairs this block will generate
        # as in the previous loop
        gx = records.groupby(cols)
        paircount = int(sum([(s * (s - 1) / 2) for s in gx.size()]))
        rule_data[rule_id] = {'paircount': paircount, 'cols': cols}
        rule_id_key = frozenset(cols)
        cols2rule_id[rule_id_key] = rule_id
        # here we generate a rule_vec, which will contain True for each 
        # pair that is covered by this rule, and False otherwise. 
        # numpy is 3x faster than the more obvious pandas approach
        # pairs[rule_id] = pairs[cols].apply(lambda r: all(r), axis=1)
        # the col-by-col approach below is 35% faster than numpy's alltrue
        # pairs[rule_id] = np.alltrue(pairs[cols], axis=1)
        # start by getting the rule_ids for the first two columns
        rule_id0 = cols2rule_id[frozenset([cols[0]])]
        rule_id1 = cols2rule_id[frozenset([cols[1]])]
        # combine these two columns by _and_ 
        rule_vec = np.logical_and(record_pairs[rule_id0], record_pairs[rule_id1])
        # then _and_ each successive column to the one we just created
        for col in cols[2:]:
            rule_id_x = cols2rule_id[frozenset([col])]
            rule_vec = np.logical_and(rule_vec, record_pairs[rule_id_x])
        record_pairs[rule_id] = rule_vec 
    print("finished num_cols = {}".format(num_cols))
end = time.time()
print("time = {:2.0f}s".format(end - start))  # 2086s
finished num_cols = 2
finished num_cols = 3
time = 2130s
In [9]:
import pickle
pickle.dump(rule_data, file=open("rule_data.pkl", 'wb'))
pickle.dump(cols2rule_id, file=open("c2ri.pkl", 'wb'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105248 entries, 0 to 105247
Columns: 7176 entries, hash1 to r7174
dtypes: bool(7174), object(2)
memory usage: 722.5+ MB

Done! well, done with setting it up, anyway

With this, we have a framework for blocking rules with conjunctions of one, two, or three fields. In the next post, we'll use the data generated here to find an optimum blocking rule with these conjunctions.