Pomegranate - Storing Billions and Billions of Tiny Little Files
Pomegranate is a novel distributed file system built over distributed tabular storage that acts an awful lot like a NoSQL system. It's targeted at increasing the performance of tiny object access in order to support applications like online photo and micro-blog services, which require high concurrency, high throughput, and low latency. Their tests seem to indicate it works:
We have demonstrate that file system over tabular storage performs well for highly concurrent access. In our test cluster, we observed linearly increased more than 100,000 aggregate read and write requests served per second (RPS).
Rather than sitting atop the file system like almost every other K-V store, Pomegranate is baked into file system. The idea is that the file system API is common to every platform so it wouldn't require a separate API to use. Every application could use it out of the box.
The features of Pomegranate are:
- It handles billions of small files efficiently, even in one directory;
- It provide separate and scalable caching layer, which can be snapshot-able;
- The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
- Build a global namespace for both small files and large files;
- Columnar storage to exploit temporal and spatial locality;
- Distributed extendible hash to index metadata;
- Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
- Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community.
Can Ma, who leads the research on Pomegranate, was kind enough to agree to a short interview.
Can you please give an overview of the architecture and what you are doing that's cool and different?
Basically, there is no distributed or parallel file system that can handle billions of small files efficiently. However, we can foresee that web applications(such as email, photo, and even video), and bio-computing(gene sequencing) need massive small file accesses. Meanwhile, file system API is general enough and well understood for most programmers.
Thus, we want to built a file system to manage billions of small files, and provide high throughput of concurrent accesses. Although Pomegranate is designed for accesses to small files, it support large files either. It is built on top of other distributed file systems, such as Lustre, and only manage the namespace and small files. We just want to stand on ''the Shoulders of Giants". See the figure bellow:
Pomegranate has many Metadata Servers and Metadata Storage Servers to serve metadata requests and small file read/write requests. The MDSs are just a caching layer, which load metadata from storage and commit memory snapshots to storage. The core of Pomegranate is a distributed tabular storage system called xTable. It supports key indexed multi-column lookups. We use distributed extendible hash to locate server from the key, because extendible hash is more adaptive to scale up and down.
In file systems, directory table and inode table are always separated to support two different types of lookup. Lookups by pathname are handled by directory table, while lookups by inode number are handled by inode table. It is nontrivial to consistently update these two indexes, especially in a distributed file system. Meanwhile, using two indexes has increased the lookup latency, which is unacceptable for accessing tiny files. Typically, there are in memory caches for dentry and inode, however, the caches can't easily extend. Modifying metadata has to update multiple locations. To keep consistency, operation log is introduced. While, operation log is always a serial point for request flows.
Pomegranate use a table-like directory structure to merge directory table and inode table. Two different types of lookup are unified to lookups by key. For file system, the key is the hash value of dentry name. Hash conflicts are resolved by a global unique id for each file. For each update, we just need to search and update one table. To eliminate the operations log, we design and support memory snapshot to get a consistent image. The dirty regions of each snapshot can be written to storage safely without considering concurrent modifications.(The concurrent updates are COWed.)
However, there are some complex file system operations such as mkdir, rmdir, hard link, and rename that should be considered. These ops have to update at least two tables. We implement a reliable multisite update service to propagate deltas from one table to another. For example, on mkdir, we propagate the delta("nlink +1") to the parent table.
Are there any single points of failure?
There is no SPOF in design. We use cluster of MDSs to serve metadata request. If one MDS crashed, the requests are redirected to other MDSs(consistent hash and heartbeats are used). Metadata and small files are replicated to multiple nodes either. However, this replication is triggered by external sync tools which is asynchronous to the writes.
Small files have usually been the death of filesystems because of directory structure maintenance. How do you get around that?
Yep, it is deadly slow for small file access in traditional file systems. We replace the traditional directory table (B+ tree or hash tree) to distributed extendible hash table. The dentry name and inode metadata are treated as columns of the table. Lookups from clients are sent(or routed if needs) to the correct MDS. Thus, to access a small file, we just need to access one table row to find the file location. We keep each small file stored sequentially in native file system. As a result, one I/O access can serve a small file read.
What posix apis are supported? Can files be locked, mapped, symlinks, etc?
At present, the POSIX support is progressing. We do support symlinks, mmap access. While, flock is not supported.
Why do a kernel level file system rather than a K-V store on top?
Our initial objective is to implement a file system to support more existing applications. While, we do support K/V interface on top of xTable now. See the figure of architecture, the AMC client is the key/value client for Pomegranate. We support simple predicates on key or value, for example we support "select * from table where key < 10 and 'xyz' in value" to get the k/v pairs that value contains "xyz" and key < 10.
How does it compare to other distributed filesystems?
We want to compare the small file performance with other file systems. However, we have not tested it yet. We will do it in the next month. Although, we believe most distributed file systems can not handle massive small file accesses efficiently.
Are indexes and any sort of queries supported?
For now, these supports has not be properly considered yet. We plan to consider range query next.
Does it work across datacenters, that is, how does it deal with latency?
Pomegranate only works in a datacenter. WAN support has not been considered yet.
It looks like you use an in-memory architecture for speed. Can you talk about that?
We use a dedicated memory cache layer for speed. Table rows are grouped as table slices. In memory, the table slices are hashed in to a local extendible hash table both for performance and space consumption. Shown by the bellow figure,
Clients issue request by hash the file name and lookup in the bitmap. Then, using a consistent hash ring to locate the cache server(MDS) or storage server(MDSL). Each update firstly gets the *opened* transaction group, and can just apply to the in memory table row. Each transaction group changing is atomic. After all the pending updates are finished, the transaction group can be committed to storage safely. This approach is similar as Sun's ZFS.
How is high availability handled?
Well, the central server for consistent hash ring management and failure coordinator should be replicated by Paxos algorithm. We plan to use ZooKeeper for high available central service.
Other components are designed to be fault tolerant. Crashes of MDS and MDSL can be configured as recovered immediately by routing requests to new servers (by selecting the next point in consistent hash ring).
Operationally, how does it work? How are nodes added into the system?
Adding nodes to the caching layer is simple. The central server (R2) add the new node to the consistent hash ring. All the cache servers should act on this change and just invalidate their cached table slices if they will be managed by the new node. Requests from clients are routed to the new server, and a CH ring change notification will piggyback to client to pull the new ring from the center server.
How do you handle large files? Is it suitable for streaming video?
As described earlier, large files are relayed to other distributed file systems. Our caching layer will not be polluted by the streaming video data.
Anything else you would like to add?
Another figure for interaction of Pomegranate components.