Strategy

Strategy: Avoid Lots of Little Files

High Scalability

04 Nov 2015 — 2 min read

I've been bitten by this one. It happens when you quite naturally use the file system as a quick and dirty database. A directory is a lot like a table and a file name looks a lot like a key. You can store many-to-one relationships via subdirectories. And the path to a file makes a handy quick lookup key.

The problem is a file system isn't a database. That realization doesn't hit until you reach a threshold where there are actually lots of files. Everything works perfectly until then.

When the threshold is hit iterating a directory becomes very slow because most file system directory data structures are not optimized for the lots of small files case. And even opening a file becomes slow.

According to Steve Gibson on Security Now (@16:10) LastPass ran into this problem. LastPass stored every item in their vault in an individual file. This allowed standard file syncing technology to be used to update only the changed files. Updating a password changes just one file so only that file is synced.

Steve thinks this is a design mistake, but this approach makes perfect sense. It's simple and robust, which is good design given, what I assume, is the original reasonable expectation of relatively small vaults.

The problem is the file approach doesn't scale to larger vaults with thousands of files for thousands of web sites. Interestingly, decrypting files was not the bottleneck, the overhead of opening files became the problem. The slowdown was on the elaborate security checks the OS makes to validate if a process has the rights to open a file.

The new version of 1Password uses a UUID to shard items into one of 16 files based on the first digit of the UUID. Given good random number generation the files should grow more or less equally as items are added. Problem solved. Would this be your first solution when first building a product? Probably not.

Apologies to 1Password if this is not a correct characterization of their situation, but even if wrong, the lesson still remains.

Kafka 101

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. Originally developed in LinkedIn during 2011, Apache Kafka is one of the most popular open-source Apache projects out there. So far it

Capturing A Billion Emo(j)i-ons

This blog post was written by Dedeepya Bonthu. This is a repost from her Medium article, approved by the author. In stadiums, sports fans love to express themselves by cheering for their favorite teams, holding up placards and team logos. Emoji’s allow fans at home to rapidly express themselves,

Brief History of Scaling Uber

This blog post was written by Josh Clemm, Senior Director of Engineering at Uber Eats. This is a repost from his LinkedIn article, approved by the author. On a cold evening in Paris in 2008, Travis Kalanick and Garrett Camp couldn't get a cab. That's when

Behind AWS S3’s Massive Scale

This is a guest article by Stanislav Kozlovski, an Apache Kafka Committer. If you would like to connect with Stanislav, you can do so on Twitter and LinkedIn. AWS S3 is a service every engineer is familiar with. It’s the service that popularized the notion of cold-storage to the

Related Articles

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale