« What does Etsy's architecture look like today? | Main | Stuff The Internet Says On Scalability For March 18th, 2016 »

To Compress or Not to Compress, that was Uber's Question

Uber faced a challenge. They store a lot of trip data. A trip is represented as a 20K blob of JSON. It doesn't sound like much, but at Uber's growth rate saving several KB per trip across hundreds of millions of trips per year would save a lot of space. Even Uber cares about being efficient with disk space, as long as performance doesn't suffer. 

This highlights a key difference between linear and hypergrowth. Growing linearly means the storage needs would remain manageable.  At hypergrowth Uber calculated when storing raw JSON, 32 TB of storage would last than than 3 years for 1 million trips, less than 1 year for 3 million trips, and less 4 months for 10 million trips.

Uber went about solving their problem in a very measured and methodical fashion: they tested the hell out of it. The goal of all their benchmarking was to find a solution that both yielded a small size and a short time to encode and decode.

The whole experience is described in loving detail in the article: How Uber Engineering Evaluated JSON Encoding and Compression Algorithms to Put the Squeeze on Trip Data. They came up with a matrix of 10 encoding protocols (Thrift, Protocol Buffers, Avro, MessagePack, etc) and 3 compression libaries (Snappy, zlib, Bzip2). The target environment was Python. Uber went to an IDL approach to define and verify their JSON protocol, so they ended up only considering IDL solutions. 

The conclusion: MessagePack with zlib.  Encoding time: 4231 ms. Decoding: 715 ms. There was a 78% reduction in size relative to the JSON zlib combination.

The result: 1 TB disk will now last almost a year (347 days), compared to a month (30 days) without compression. Uber now has enough space to last over 30 years compared to just under 1 year before. That's a huge win for a relatively simple change. Hopefully there's a common library handling all the messaging so this change could be completely transparent to all the developers. Uber also noted that smaller packet sizes mean less data transiting through the system which means less processing time which means less hardware is needed. Another big win.

Something to consider: don't use JSON for messaging. The compression/decompression times are still dog slow. If you are going to use an IDL, which every grown up project eventually moves to for reliability and security reasons, consider not using JSON for messaging. Go for a binary protocol from the start. The performance savings can be dramatic. A lot of the performance drain comes from serialization/deserialization churning through memory and that can be reduced greatly by not using text based protocols like JSON. JSON is convenient, but it's also hugely wasteful at scale. 

Reader Comments (6)

Typo: 32 TB of storage would last [LESS] than 3 years

March 22, 2016 | Unregistered CommenterAlex

Seems like a lot of circle jerk re-inventing the wheel. Storing geo data (and maybe position deltas) as binary numbers in a plain old sql database (on a compressed file system if you re really cheap) would be just as storage efficient and much simpler. Explain to me again how it's better to do analytics on terabytes of schema-less data?

March 23, 2016 | Unregistered CommenterBill Gatez

Messagepack is non-IDL encoding that uber chose. The article mentions uber went for IDL based solution. What am I missing?

March 29, 2016 | Unregistered CommenterUkr

Yah, that confused me too. So I looked it up and there's https://github.com/msgpack-rpc/msgpack-rpc/tree/master/idl.

March 29, 2016 | Registered CommenterTodd Hoff

So by adding compression on all the data blobs, how much CPU overhead is that vs not compressing? It's generally assumed that storage is cheaper than CPU.. unless you're buying high end CPUs or thousands of CPU to keep up with the workload.

Would another option be to keep two tiers of data: immediate need and long term storage, where long term is compressed?

April 7, 2016 | Unregistered CommenterWeb Scaler

to follow up, when doing serverless, compute is often much cheaper than storing data, e.g. in dynamoDb. This is the same approach i used for a project and it cut cost for storing of data by 75% (2000 Dollars per month) and increasing compute by 10% (50 dollars per month).

April 4, 2019 | Unregistered Commenterjk

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>