Hyperloglog: A Simple Estimation of The Number of Unique Elements in a Large Data Set

We are increasingly moving towards large datasets, and where we need to build hashtables based on the data elements that we have. But, how…

Photo by Mathew Schwartz on Unsplash

Hyperloglog: A Simple Estimation of The Number of Unique Elements in a Large Data Set

We are increasingly moving towards large datasets, and where we need to build hashtables based on the data elements that we have. But, how do we count the number of data elements that are unique? Well, first we have to parse our input data into data elements. As a simple example — in Python — we can just take a string, and then parse it for words:

data1 =str1.split(" ")

Once we have these words, we can then create a data set from this, and then count the number of elements (cardinality) in the data set:

s1 = set(data1)print("Actual cardinality is", len(s1))
print("Data set: ",s1)

The creation of the data set can be a computationally intensive operation, thus a method is known as HyperLogLog [1] and which can quickly estimate the number of unique elements:

In this case, we will use a HyperLogLog method, and compare it with the creation of a dataset in Python [here]:

import sysfrom datasketch 
import HyperLogLogstr1 = "One two one two one two one two one two four"
if (len(sys.argv)>1):
str1=str(sys.argv[1])
data1 =str1.split(" ")
print ("String:\t",str1)
m1 = HyperLogLog()
for d in data1:
m1.update(d. ('utf8'))
print("Estimated cardinality is", m1.count())
s1 = set(data1)
print("Actual cardinality is", len(s1))
print("Data set: ",s1)

For example, with a string of “One one two one two” we have a cardinality of 3 (“One”, “one” and “two”) [here]:

String:  One two one two one two one two one two four
Estimated cardinality is 4.031579383843613
Actual cardinality is 4

LogLog

So how can we quickly estimate the number of elements that we have? This problem was solved by Flajolet and Martin [1]:

With this we could have a six-number sequence, and then find the longest sequence of ‘0”. And so “123456” has the longest sequence of zero, and “012345” has the longest sequence of two. For 10 samples, the longest sequence of zeroes is likely to be 1, for 100, it is likely to be 2. Overall, if we count the largest number of sequences of zeroes, we can estimate that the population size is around 10ᴸ. In this way, we significantly reduce the complexity of the problem by a logarithmic factor.

The paper takes input values then converts these into a hash form, where we now have sequences for binary values. The estimate of the cardinality is then 2ᴸ. The HyperLogLog [2] uses a range of hashing methods — each independent of each other — in order to calculate the overall average for the measure:

For example, we may get the longest sequence of zeros from the hashing methods as L₁, L₂, …, Lₘ. The estimation is then the average of these values. The paper [2] extended this method by using a single hashing method, and the split the values into different buckets. Each of these buckets are indexed by taking the first few bits of the hash value, and then counting the sequence of zeroes of the values in the buckets. For example, we might create eight buckets, and where the first three bits identify the bucket. In the following, we can then define the first two values into the first bucket, and the next three values into the next bucket:

000 010010
000 100100
001 110010
001 011010
001 100010

In the first bucket, the longest sequence is 2, and in the second one, the longest sequence is 3, then the average is 2.5.

References

[1] Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2), 182–209.

[2] Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007, June). Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science (pp. 137–156). Discrete Mathematics and Theoretical Computer Science.