HyperLogLog [1] provides a way to estimating the number of distinct values (cardinality) in a data set. In this case we will use a HyperLogLog method, and compare with the creation of a dataset in Python. For example, with a string of "One one two one two" we have a cardinality of 3 ("One", "one" and "two").
HyperLogLog - estimating the number of unique data elements |
Outline
The following uses the Hyperloglog method to estimate the number of data elements (in this case the data elements are words within a string):
import sys from datasketch import HyperLogLog str1 = "One two one two one two one two one two four" if (len(sys.argv)>1): str1=str(sys.argv[1]) data1 =str1.split(" ") print ("String:\t",str1) m1 = HyperLogLog() for d in data1: m1.update(d.encode('utf8')) print("Estimated cardinality is", m1.count()) s1 = set(data1) print("Actual cardinality is", len(s1))
Coding
References
[1] Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007, June). Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science (pp. 137-156). Discrete Mathematics and Theoretical Computer Science.