Data ‘hashing’ improves estimate of the number of victims in databases
But while HRDAG’s estimate relied on the painstaking efforts of human workers to carefully weed out potential duplicate records, hashing with statistical estimation proved to be faster, easier and less expensive. The researchers said hashing also had the important advantage of a sharp confidence interval: The range of error is plus or minus 1,772, or less than 1 percent of the total number of victims.
“The big win from this method is that we can quickly calculate the probable number of unique elements in a dataset with many duplicates,” said Patrick Ball, HRDAG’s director of research. “We can do a lot with this estimate.”
Read full article off-site