Strategy: Avoid Lots of Little Files
I've been bitten by this one. It happens when you quite naturally use the file system as a quick and dirty database. A directory is a lot like a table and a file name looks a lot like a key. You can store many-to-one relationships via subdirectories. And the path to a file makes a handy quick lookup key.
The problem is a file system isn't a database. That realization doesn't hit until you reach a threshold where there are actually lots of files. Everything works perfectly until then.
When the threshold is hit iterating a directory becomes very slow because most file system directory data structures are not optimized for the lots of small files case. And even opening a file becomes slow.
According to Steve Gibson on Security Now (@16:10) LastPass ran into this problem. LastPass stored every item in their vault in an individual file. This allowed standard file syncing technology to be used to update only the changed files. Updating a password changes just one file so only that file is synced.
Steve thinks this is a design mistake, but this approach makes perfect sense. It's simple and robust, which is good design given, what I assume, is the original reasonable expectation of relatively small vaults.
The problem is the file approach doesn't scale to larger vaults with thousands of files for thousands of web sites. Interestingly, decrypting files was not the bottleneck, the overhead of opening files became the problem. The slowdown was on the elaborate security checks the OS makes to validate if a process has the rights to open a file.
The new version of 1Password uses a UUID to shard items into one of 16 files based on the first digit of the UUID. Given good random number generation the files should grow more or less equally as items are added. Problem solved. Would this be your first solution when first building a product? Probably not.
Apologies to 1Password if this is not a correct characterization of their situation, but even if wrong, the lesson still remains.
Related Articles
- Efficient Access to Many Small Files in a Filesystem for Grid Computing
- The Small Files Problem
- Can a large number of (small) files degrade the performance of a filesystem?
- filesystem for millions of small files
- NTFS performance and large volumes of files and directories
- Pomegranate - Storing Billions And Billions Of Tiny Little Files
- What does opening a file actually do?