Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory
This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.
At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.
To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example:
4f67bfc603106cb2
These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day's worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don't have a single machine with several hundred gigs of free memory sitting around so we needed a better solution.
One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap is mapping the input dataset to a bit field using a hash function where each input element uniquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you will need one billion bits, or roughly 120 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.
Luckily, cardinality estimation is a popular area of research. We've leveraged this research to provide a open source implementation of cardinality estimators, set membership detection, and top-k algorithms.
Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare's works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:
Counter |
Bytes Used |
Count |
Error |
HashSet |
10447016 |
67801 |
0% |
Linear |
3384 |
67080 |
1% |
HyperLogLog |
512 |
70002 |
3% |
The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.
Linear Probabilistic Counter
The Linear Probabilistic Counter is space efficient and allows the implementer to specify the desired level of accuracy. This algorithm is useful when space efficiency is important but you need to be able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the each entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.
n=-m ln Vn
In the equation m is the size of the bitmap and Vn is the ratio of empty bits over the size of the map. The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality. How much smaller depends on how much error you can tolerate in the result. Because the size of the bitmap, m, is smaller than the total number of distinct elements, there will be collisions. These collisions are required to be space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.
Hyper LogLog
The Hyper LogLog Counter's name is self-descriptive. The name comes from the fact that you can estimate the cardinality of a set with cardinality Nmax using just loglog(Nmax) + O(1) bits. Like the Linear Counter the Hyper LogLog counter allows the designer to specify the desired accuracy tolerances. In Hyper LogLog's case this is done by defining the desired relative standard deviation and the max cardinality you expect to count. Most counters work by taking an input data stream, M, and applying a hash function to that set, h(M). This yields an observable result of S = h(M) of {0,1}^∞ strings. Hyper LogLog extends this concept by splitting the hashed input stream into m substrings and then maintains m observables for each of the substreams. Taking the average of the additional observables yields a counter whose accuracy improves as m grows in size but only requires a constant number of operations to be performed on each element of the input set. The result is that, according to the authors of this paper, this counter can count one billion distinct items with an accuracy of 2% using only 1.5 kilobytes of space. Compare that to the 120 megabytes required by the HashSet implementation and the efficiency of this algorithm becomes obvious.
Merging Distributed Counters
We've shown that using the counters described above we can estimate the cardinality of large sets. However, what can you do if your raw input dataset does not fit on single machine? This is exactly the problem we face at Clearspring. Our data is spread out over hundreds of servers and each server contains only a partial subset of the the total dataset. This is where the fact that we can merge the contents of a set of distributed counters is crucial. The idea is a little mind-bending but if you take a moment to think about it the concept is not that much different than basic cardinality estimation. Because the counters represent the cardinality as set of bits in a map we can take two compatible counters and merge their bits into a single map. The algorithms already handle collisions so we can still get a cardinality estimation with the desired precision even though we never brought all of the input data to a single machine. This is terribly useful and saves us a lot of time and effort moving data around our network.
Next Steps
Hopefully this post has helped you better understand the concept and application of probabilistic counters. If estimating the cardinality of large sets is a problem and you happen to use a JVM based language then you should check out the stream-lib project — it provides implementations of the algorithms described above as well as several other stream-processing utilities.
Related Articles
Reader Comments (39)
Thanks for your marvelous posting! I actually enjoyed reading it, you will be a great author.I will ensure that I bookmark your blog and will come back in the foreseeable future. I want to encourage that you continue your great job, have a nice weekend!think you’ve made some truly interesting points.
640-802//
Still exploring the algorithms. A couple of quick observations:
1) In scenarios where BF is practical, e.g. when we approximately know the size of the input multiset and allocate a comfortably sized BF (and especially when we only care about "is the cardinality larger than X? for a fixed "X") BF is much more accurate and reliable.
2) Also, BF only errs on one side (it can only underestimate the cardinality - so it's always provides a lower bound, whereas probabilistic counting can err in both sides.
Guava's BloomFilter#put was made to return a boolean ("was the BF updated?") particularly to enable approximate cardinality counting.
Most of the algorithms I saw (4 or 5) are very elegant, but each seems to have weird weaknesses and limitations (some are bad with small cardinalities, some require cumbersome tuning), so, not sure what of these could be offered as part of a general purpose library (I'm still examining the potential).
To be honest I was hoping that 512 counters could lead to better estimates than 3% off. Did you try any of the tuning/corrections recommended by Flajolet to see if that succeeds in bringing the error down? Seems like a bit of black art there.
Dimitris -
Yes, certainly if you know the size of the multiset and you can fit the required bits in memory than that provides a more accurate solution than any of the probabilistic options. However if you need to count many dimensions than storing n BFs in memory where n is the number of dimensions you want to count can cause problems. In our use cases we often end up with many tens of thousands of counters. Storing any individual counter in memory using a BF would not be an issue but storing them all in memory becomes problematic.
One thing we've done in stream-lib is create a CountThenEstimate class, https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/CountThenEstimate.java, that 'tips' after the cardinality exceeds a predefined threshold. So below a certain value we use a simple set to get a perfect answer but when we tip over to counter converts to one of our probabilistic counters.
The HyperLogLog algorithm provides a small range correction function that makes the implementation reasonably accurate for small cardinalities. I've added an example implementation to the stream-lib project. https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java.
Re using a BloomFilter to estimate cardinality. That is a reasonable solution if you only have a single dimension and your data is only present only on a single node. Because a BloomFilter#put returns the true/false flag if the set was modified by your addition you can only keep a raw count outside of the BloomFilter. Because of this there is no way to merge the results from multiple dimensions or servers right? You can't merge N bloomfilters and than asked the merged data structure what the total count is.
HyperLogLog gets around 3% standard deviation error with a 512 counter. That is a counter that count more than a billion elements at that 3% error which is pretty impressive. If your multiset is much smaller than a billion than you can use other counter implementations that will get higher accuracy with less space requirements. For example a linear counter can give you 1% error for sets up to around 3.5K using 730. And of course if you need a more accurate solution and can afford the space HLL can give you a more accurate solution than 3% if that is what is desired. I'm not sure which tuning recommendations you are referring to, can you point me to the relevant section of his papers?
Feel free to contact me directly if you'd like to discuss this further.
Matt
@abramsm
Grimaldi - We haven't considered implementing approximate histograms in stream-lib. Thanks for the pointer. We do use the approximate histogram implementation from this project, https://github.com/codahale/metrics/blob/master/metrics-core/src/main/java/com/yammer/metrics/core/Histogram.java
Matt
Hi Matt,
Sorry for the vague reference, there are many papers and I was hoping that you would magically know what I'm talking about :) I think I mean the "Loglog Counting of Large Cardinalities" paper, section 5, "Algorithmic engineering", truncation rule, restriction rule. Not sure how well these translate to HyperLogLog though.
In my case, I was aiming for small counts, around 3000, and got errors around 3% on average, with HyperLogLog and 512 counters, but the variance was very high (sometimes I saw estimates as low as 2600 and as high as 3340, or so). A BF of just 4x the space, would always give an estimate in 2905 to 2950 - a simple range correction and I would get nearly the correct answer every time. Of course, BF has its drawbacks too, as you note.
I somehow missed the linear counting approach (it's even in your post, duh), thanks for reminding me of that, will have to try. Will write back if I have something to add, thanks again
Dimitris
For the distributed counter case, would the algebra behind Join Sampling be useful? The concept is you drop some samples in each leg, then the join knows how many samples were dropped and subsamples the remaining join.
http://research.microsoft.com/pubs/76565/sig99sam.pdf
http://theory.stanford.edu/~matias/papers/join-sigmod96.pdf
I haven't thought of using Join Sampling. Thanks for the links to the papers. If you are interested in contributing something like this please join our mailing list for the stream-lib (https://github.com/clearspring/stream-lib) project so we can discuss with the group.
Matt
Sorry, too many other crazy projects. Also, just because I read the papers doesn't mean I understand them :)
After spending a few years with HyperLogLog as a key component of our infrastructure, we wrote up a blog post about it and included a D3-based simulation to help you understand how it works. I hope that it brings additional insight into this awesome algorithm.
For the statement, "Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM."
What's the math that figures it will take 119 gigs of RAM?
I'm certainly not understanding this question correctly and hoping for clarification. I know I'm a noob, but want to understand more. It seems to be that 4 bytes = 32 bits and since 2^32 = about 4 billion I don't understand why 4 bytes can't count 1 billion objects. Perhaps the reason is that 1 in 4 numbers would then be a valid object and this would not be secure. Explanation is appreciated!!!
"Compare that to the 120 megabytes required by the HashSet ......" The HashSet here should be bitmap ?
Wonderful Post. Really amazing and informative.
In the sentence "Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM," it is mentioned a few sentences earlier that the total storage for the IDs alone should run to 45 GB. That being the case, and 1/3 are unique, how can the size of the hash come out to greater than 45 GB ? I think it should be (2/3) * 45 = 30GB