« Statistics Logging Scalability | Main | Product: Sequoia Database Clustering Technology »

Kosmos File System (KFS) is a New High End Google File System Option

There's a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there's a better than average chance this software will be worth considering.

After you stop trying to turn KFS into "Kentucky Fried File System" in your mind, take a look at KFS' intriguing feature set:

  • Incremental scalability: New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
  • Availability: Replication is used to provide availability due to chunk server failures. Typically, files are replicated 3-way.
  • Per file degree of replication: The degree of replication is configurable on a per file basis, with a max. limit of 64.
  • Re-replication: Whenever the degree of replication for a file drops below the configured amount (such as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on the remaining chunk servers. Re-replication is done in the background without overwhelming the system.
  • Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
  • Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
  • File writes: The system follows the standard model. When an application creates a file, the filename becomes part of the filesystem namespace. For performance, writes are cached at the KFS client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also, applications can force data to be flushed to the chunkservers. In either case, once data is flushed to the server, it is available for reading.
  • Leases: KFS client library uses caching to improve performance. Leases are used to support cache consistency.
  • Chunk versioning: Versioning is used to detect stale chunks.
  • Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
  • Language support: KFS client library can be accessed from C++, Java, and Python.
  • FUSE support on Linux: By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.
  • Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are provided.
  • Deploy scripts: To simplify launching KFS servers, a set of scripts to (1) install KFS binaries on a set of nodes, (2) start/stop KFS servers on a set of nodes are also provided.

    This seems to compare very favorably to GFS and is targeted at:
  • Primarily write-once/read-many workloads
  • Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size
  • Mostly sequential access

    As Rich says everyone needs to solve the "storage problem" and this looks like an exciting option to add to your bag of tricks. What we are still missing though is a Bigtable like database on top of the file system for scaling structured data.

    If anyone is using KFS please consider sharing your experiences.

    Related Articles

  • Hadoop
  • Google Architecture
  • You Can Now Store All Your Stuff on Your Own Google Like File System.
  • Reader Comments (9)

    One of the interesting aspects to Google's MapReduce is that they can schedule work to be done on the same rack as a data chunk. Does KFS provide this type of capability? Say I was using Java/JMS, could I somehow direct work to the appropriate racks?

    November 29, 1990 | Unregistered CommenterAnonymous

    I didn't find any mention of that capability. Though I think the rise of 10 Gbps network will make this a largely unneeded feature. Your compute and storage nodes won't need to be entwined if the networks have a low enough latency and high enough bandwidth.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    The ability to do job placement for Map/Reduce with KFS is available: The KFS client library exports an API to determine the location of a byte range of a file.

    In more detail, KFS is currently integrated with Hadoop using Hadoop's Filesystem API. When Hadoop Map/Reduce layer schedules jobs, it relies on the underlying filesystem to determine location of data. That is, Hadoop Filesystem interface has an API "getFileCacheHints()". My Hadoop-KFS filesystem implementation supports this API. Thus, when KFS is used as an alternate backing store with Hadoop, job placement can be done.

    November 29, 1990 | Unregistered CommenterSriram Rao

    Sriram -
    That's great! I find Hadoop over-engineered and began building my own workflow engine + distributed processing framework before I learned about it. At the moment it scales >100,000 workers per task, but without an equivilant distributed FS to GFS the real bottleneck is that I have to keep the payloads in memory / sent with the messages for management. If I can adapt KFS to be the back-end store, then there won't be anything that Hadoop can do that I can't do better. ;-)

    November 29, 1990 | Unregistered CommenterAnonymous

    Glad to hear it! Please feel free to get in touch with me. I'd be happy to help you with KFS internals.


    November 29, 1990 | Unregistered CommenterSriram Rao

    From http://kosmosfs.sourceforge.net/
    > Meta-data server : a single meta-data server that provides a global namespace

    Is there some sort of fault-tolerance or fail-over option for the "single meta-data server"? I'm presuming so, but didn't notice anything about it in the article or home page.

    November 29, 1990 | Unregistered CommenterTom


    In the current release, there is no fail-over option for the meta-data server. We are working on this. This is something that will be fixed in the next release.

    If the concern is losing the filesystem due to the crash of the meta-data server node, you can workaround by backing up the meta-data server logs/checkpoint on remote node (rather than just relying on local disk of the meta-data server node); if NFS access is available, then the logs/checkpoints could be store on the NFS server.


    November 29, 1990 | Unregistered CommenterSriram Rao

    Is there any arch/design doc for kfs?

    November 29, 1990 | Unregistered CommenterEinav Itamar


    I wanted to understand how mapreduce happens using KFS as file system in Hadoop. The KFS wiki mentions

    >> ./bin/start-mapred.sh
    >>If the map/reduce job/task trackers are up, all I/O will be done to KFS.

    If in the Hadoop conf files i am giving metaserver ip and port then how are the TaskTracker daemons started on the different nodes which contain the data.

    After issuing a Mapreduce command would my hadoop client fetch all the data from different Kosmos servers to my local machine and then do a Mapreduce or would it start the TaskTracker daemons on the machine(s) where the input file(s) are located and perform a Mapreduce there? Please rectify me if I am wrong but I suppose that the location of input files to Mapreduce is being returned by the function getFileBlockLocations (FileStatus, long, long).

    Thank you very much for your time and helping me out.


    March 1, 2013 | Unregistered CommenterNikhil

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>