advertise
« Scalability Perspectives #3: Marc Andreessen – Internet Platforms | Main | Google Architecture »
Monday
Nov242008

Product: Scribe - Facebook's Scalable Logging System


In Log Everything All the Time I advocate applications shouldn't bother logging at all. Why waste all that time and code? No, wait, that's not right. I preach logging everything all the time. Doh. Facebook obviously feels similarly which is why they opened sourced Scribe, their internal logging system, capable of logging 10s of billions of messages per day. These messages include access logs, performance statistics, actions that went to News Feed, and many others.

Imagine hundreds of thousands of machines across many geographical dispersed datacenters just aching to send their precious log payload to the central repository off all knowledge. Because really, when you combine all the meta data with all the events you pretty much have a complete picture of your operations. Once in the central repository logs can be scanned, indexed, summarized, aggregated, refactored, diced, data cubed, and mined for every scrap of potentially useful information.

Just imagine the log stream from all of Facebook's Apache servers alone. Brutal. My guess is these are not real-time feeds so there are no streaming query issues, but the task is still daunting. Let's say they log 10 billion messages a day. That's over 1 million messages per second!

When no off the shelf products worked for them they built their own. Scribe can be downloaded from Sourceforge. But the real action is on their wiki. It's here you'll find some decent documentation and their support forums. Not much activity on the site so you haven't missed your chance to be a charter member of the Scribe guild.

A logging system has three broad components:

  • Client Code Interface - How does your code interact with the log system? Scribe doesn't do much for you here. There's a simple Thrift interface for logging from a large set of languages, but the bulk of the work is stull up to you.
  • Distribution System - This is were Scribe fits. It reliably (mostly) moves large numbers of messages around. A few error cases lead to data loss: 1) If a client can't connect to either the local or central scribe server the message will be loss; 2) If a scribe server crashes it could lose a small amount of data that's in memory but not on disk; 3) Some multiple component failure cases, such as a resender can't connect to any central server and its local disk fills up; 4) Some rare timeout conditions can lead to duplicate messages
  • Do Something Usefullizer - How do you do anything useful with over 1 million messages per second? Good question. Scribe doesn't help here. But Scribe will get your data their.

    I browsed around the source and it's a well crafted, straightforward socket server that forwards messages to other servers and can write messages to disk. Nothing fancy which is why it probably works for them. It's basic function is:

    Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed file system, or send them to another layer of scribe servers.
    It some ways it could be fancier. For example, there's no throttle on incoming connections so a server can chew up memory. And there is a max_msg_per_second throttle on message processing, but this is really to simple. Throttling needs to be adaptive based on local conditions and the conditions of down stream servers. Under load you want to push flow control back to the client so the data stays there until resources become available. Simple configuration file settings rarely work when the world starts getting weird.

    Client Code Interface

    Here's what the Thrift interface looks like:

    enum ResultCode
    {
    OK,
    TRY_LATER
    }

    struct LogEntry
    {
    1: string category,
    2: string message
    }

    service scribe extends fb303.FacebookService
    {
    ResultCode Log(1: list messages);
    }
    I know, I thought the same thing. Thank God there's another IDL syntax. We simply did not have enough of them. Thrift translates this IDL into the glue code necessary for making cross-language calls (marshalling arguments and responses over the wire). The Thrift library also has templates for servers and clients.

    Here's what a call looks like in PHP:

    $messages = array();
    $entry = new LogEntry;
    $entry->category = "buckettest";
    $entry->message = "something very interesting happened";
    $messages []= $entry;
    $result = $conn->Log($messages);


    Pretty simple. Usually in C++, for example, there's an elaborate set of macros for logging that provide sophisticated control of log generation. It might look something like:

    MSG(msg) - a simple message. It only prints out msg. None of the other information is printed out.
    NOTE(const char* name, const char* reason, const char* what, Module* module, msg) - something to take note of.
    WARN(const char* name, const char* reason, const char* what, Module* module, msg) - a warning.
    ERR(const char* name, const char* reason, const char* what, Module* module, msg) - an error occured.
    CRIT(const char* name, const char* reason, const char* what, Module* module, msg) - a critical error occurred.
    EMERG(const char* name, const char* reason, const char* what, Module* module, msg) - an emergency occurred.


    There's lots more to handle streams and behind the scenes things like time stamps, thread ids, function names, and line numbers. Scribe has wisely not done any of that. It has a RPC like interface to send a list of messages and that's it. It's up to you to write the wrappers.

    You'll no doubt have noticed Scribe only logs a category and message, both strings:

    Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the "store" abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.

    Distribution System

    The payload has whatever structure you give it. Scribe is policy neutral and doesn't push a logging model on you.

    The configuration file looks something like this:

    # BUCKETIZER TEST
    <store>
    category=buckettest
    type=buffer
    target_write_size=20480
    max_write_interval=1
    buffer_send_rate=2
    retry_interval=30
    retry_interval_range=10
    <primary>
    type=bucket
    num_buckets=6
    bucket_subdir=bucket
    bucket_type=key_hash
    delimiter=1
    <bucket>
    type=file
    fs_type=std
    file_path=/tmp/scribetest
    base_filename=buckettest
    max_size=1000000
    rotate_period=hourly
    rotate_hour=0
    rotate_minute=30
    write_meta=yes
    </bucket>
    </primary>
    <secondary>
    type=file
    fs_type=std
    file_path=/tmp
    base_filename=buckettest
    max_size=30000
    </secondary>
    </store>
    The types of stores currently available are:
  • file - writes to a file, either local or nfs.
  • network - sends messages to another scribe server.
  • buffer - contains a primary and a secondary store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary.
  • bucket - contains a large number of other stores, and decides which messages to send to which stores based on a hash.
  • null - discards all messages.
  • thriftfile - similar to a file store but writes messages into a Thrift TFileTransport file.
  • multi - a store that forwards messages to multiple stores.

    Certainly a flexible and useful set of logging capabilities. You can build a hierarchy of log servers to do pretty much anything you want. You could imagine have a log server on each server that has file store to handle upstream server failures. This log server forwards messages onto a centralized server for a datacenter. And all the datacenter servers forward their logs on to the centralized data warehouse. To scale adjust fan-in and fan-out as necessary.

    Do Something Usefullizer

    You may not have over 1 million log messages a second to process, but you are likely to have your own tanker trunk full of log messages. How do you do something useful with them?
  • Log messages stored in log files are next to useless. Grep'ing on a terabyte of logs to answer simple questions about your data just doesn't work.
  • You may have a sharded datawarehouse you can pump log messages into and do reasonably effective job of querying.
  • Or you can set up a HADOOP/HDFS. style system. The idea here is you need a distributed file system to handle the continual stream of log messages. And once you have all the data stored safely away you'll need to use map-reduce to do anything with such a large amount of data.

    If you want to ask, for example, how many of your users are from Asia, log files won't work. It's likely your data warehouse can't handle it. HADOOP/HDFS is a practical option.

    If that's the direction you are going what does it imply about your log system? I would say it makes even the simple category-payload system of Scribe overkill. The with a scalable backend is to move log payloads from applications to the centralized store as quickly as possible. By definition the central store can handle the load, so there's no reason to use intermediate servers to scale. From an application write directly to the central store, even from multiple datacenters. The payload structure is unimportant until it hits the central store. If the application can't hit the central store then it queues into the file system until it can. Ideally log messages never hit the file system until HDFS is writing them to their final destination. This makes for a low latency and high throughput logging and is even simpler than Scribe.

    If you don't have a scalable central store then Scribe is a good option. It gives you all the flexibility you need to compose your logging system in a way that is mostly reliabile and scalable.
  • Reader Comments (12)

    "Log messages stored in log files are next to useless. Grep'ing on a terabyte of logs to answer simple questions about your data just doesn't work."

    One nice thing about identifying log messages with a category is that they can be logged to separate files based on that category. This way it's easy to grep out of a relatively small log file when you want to do an ad-hoc query.

    November 29, 1990 | Unregistered CommenterAdam Hupp

    Actually I didn't remove anything. Something weird happened to the comments. Many are just gone and the time shifted a week on others. Investigating. As long as it's not spam I let everything through.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    When you have terabytes of logging data you use Facebook's Hive

    > Hive was developed iteratively by a 2 or 3 person team (I think Jeff Hammerbacher was also involved) making it easy for business analysts to ask ad hoc questions of terabytes worth of logfile data by abstracting MapReduce into a SQL like dialect.

    http://www.brokenbuild.com/blog/2008/03/27/hadoop-summit-facebook-creates-business-intelligence-tool-called-hive/

    November 29, 1990 | Unregistered CommenterWes Maldonado

    Fair enough. But since there's no longer any comment pointing out the error (including the comment you replied to):

    10 Billion a day is *NOT* 1 million per second. The article should be corrected.

    November 29, 1990 | Unregistered CommenterAnonymous

    1 Billion per day = (10000000000/(24*60*60))=115 740 per second
    *NOT* 1 million per second!

    November 29, 1990 | Unregistered CommenterMath-Doctor

    Well, he said 10's of billions, so let's assume 10 billion, multiply by 10 and you have roughly 1M (1 157 400 since you're so precise) per second, *NOT* 115 740 per second.

    P.S. have a nice day.

    November 29, 1990 | Unregistered Commentersquiggles

    The parent comment actually calculated with 10 billion.

    PLEASE do the VERY simple math yourself before correcting anyone.

    The fact that this article has STILL not been corrected, and that there was just a whole batch of "oh my god I love Sun" articles today I'm considering unsubscribing the RSS.

    10*10^9/(60*60*24) = 115740

    November 29, 1990 | Unregistered CommenterAnonymous

    Why another logging software? Why not an already existing solution, like
    rsyslog, or syslog-ng? Or an appliance, like SSB? Why should we always reinvent the wheel?

    November 29, 1990 | Unregistered Commentertalien

    Nice post, thanks. This bit kills me:
    "Thank God there's another IDL syntax. We simply did not have enough of them."
    Ha, ha, ha, my thoughts exactly... :-D

    But I disagree with your assessment of scribe's usability. We are investigating using scribe in our system. You can only write big files to HDFS. To us each of the messages from multiple nodes is very important so we need to aggregate them somewhere safely (distributed) and upload them to HDFS in batches. From what I understand Scribe is ideal for this.

    Sure, you can make your own system that is distributed, memory-only, scalable, uploads in batches, does backups and everything, but - do you really think you can do better than scribe?

    November 29, 1990 | Unregistered CommenterKit

    Application for which i am using scribe logs around 20-30K messages in a
    minute. Can you propose configuration setting for these many messages.
    1 message is of around 1000-1500 bytes size.

    April 13, 2010 | Unregistered Commentershaklendra

    Very badly written article. Does not even present anything concretely and just all over the place. I was asked to look at Scribe so I stumbled here. Please do not waste time on it.

    June 21, 2012 | Unregistered CommenterAshutosh Singh

    So log files are great and everything, especially for our privacy, how long does Facebook store / save these log files for though? I mean it's impossible to find out anything about actual time data with them. Any info on this?

    August 8, 2012 | Unregistered CommenterCuriousCat

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>