Some Facebook Secrets to Better Operations

Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

  • Frequent Releases. A major release once a week and a minor releases every few days.
  • Create a Cyber Liability Group. At one time operations was distributed amongst several groups. A permanent operations group was created to isolate problems and revert problem software components back to previously known good states. The ability of a separate team to handle rollbacks speaks to a great deal of standardization and advanced tool building.
  • Distribute Team Across Time Zones. Split the operations team across different time zones so no one has to work the graveyard shift. Facebook has 20 people in their team located in Palo Alto, California and London, England.
  • Be Innovative, Not Safe. Fear of failure often shuts down the organizational brain and makes it hide behind excessive rules and regulations. A technology company should have a bias towards action and innovation. Release software. Don't stifle genius. Rely on your tools and processes to recover from problems.
  • Expect Problems. Software pushed to production will have problems. Expect problems, but don't let that stop you from innovating.
  • Roll Backward. When a problem is detected in a release the changes can either be rolled forward or backward. Rolling back is going to a previously good release. Rolling forward is fixing problems in the new release rather than rolling back. Bugs in production are fixed in production. Roll forward ends up being covered in the press, so prefer roll backs over roll forwards.
  • Roll Out Massive Changes Slowly. Turn on features gradually, for a few percent of users at a time. Use the slow rollout to fix problems that can only be found under real user conditions. This approach give operations and development a lot of confidence in changes.
  • Encourage Openness and Information Sharing. Design reviews, PR strategy, which servers to buy, etc are often open for informal debate among employees. Facebook has created an Ideas system where employees can create an Idea by category. There's a discussion tool for discussing the idea and a rating system for rating the idea. Tools are built on-top of Facebook platform so they are available to everyone.
  • Live-blog Key Events. Large company meetings, monthly presentations and weekly Q&As with the management team are transcribed live.

    It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.

    Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.

    To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.

    Related Articles

  • HighScalability Operations
  • On Designing and Deploying Internet-Scale Services
  • Amazon Architecture
  • Monday

    A Scalability checklist?

    Hi everyone, I'm researching on Scalability for a college paper, and found this site great, but it has too many tips, articles and the like, but I can't see a hierarchical organization of subjects, I would need something like a checklist of things or fields, or technologies to take into account when assesing scalability. So far I've identified these: - Hardware scalability: - scale out - scale up - Cache What types of cache are there? app-level, os-level, network-level, I/O-level? - Load Balancing - DB Clustering Am I missing something important? (I'm sure I am) I don't expect you to give a lecture here, but maybe point some things out, give me some useful links... Thanks!

    Click to read more ...


    Paper: GargantuanComputing—GRIDs and P2P

    I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating: A mathematical analysis by Ritter (2002) (one of the original developers of Napster) presented a detailed numerical argument demonstrating that the Gnutella network could not scale to the capacity of its competitor, the Napster network. Essentially, that model showed that the Gnutella network is severely bandwidth-limited long before the P2P population reaches a million peers. In each of these previous studies, the conclusions have overlooked the intrinsic bandwidth limits of the underlying topology in the Gnutella network: a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition). Trees are known to have lower aggregate bandwidth than higher dimensional topologies, e.g., hypercubes and hypertori. Studies of interconnection topologies in the literature have tended to focus on hardware implementations (see, e.g., Culler et al. 1996; Buyya 1999), which are generally limited by the cost of the chips and wires to a few thousand nodes. P2P networks, on the other hand, are intended to support from hundreds of thousands to millions of simultaneous peers, and since they are implemented in software, hyper-topologies are relatively unfettered by the economics of hardware. In this chapter, we analyze the scalability of several alternative topologies and compare their throughput up to 2–3 million peers. The virtual hypercube and the virtual hypertorus offer near-linear scalable bandwidth subject to the number of peer TCP/IP connections that can be simultaneously kept open.

    Click to read more ...


    Product: ScaleOut StateServer is Memcached on Steroids

    ScaleOut StateServer is an in-memory distributed cache across a server farm or compute grid. Unlike middleware vendors, StateServer is aims at being a very good data cache, it doesn't try to handle job scheduling as well. StateServer is what you might get when you take Memcached and merge in all the value added distributed caching features you've ever dreamed of. True, Memcached is free and ScaleOut StateServer is very far from free, but for those looking a for a satisfying out-of-the-box experience, StateServer may be just the caching solution you are looking for. Yes, "solution" is one of those "oh my God I'm going to pay through the nose" indicator words, but it really applies here. Memcached is a framework whereas StateServer has already prepackaged most features you would need to add through your own programming efforts. Why use a distributed cache? Because it combines the holly quadrinity of computing: better performance, linear scalability, high availability, and fast application development. Performance is better because data is accessed from memory instead of through a database to a disk. Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding. Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure. Application development is faster because there's only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data. StateServer follows the RAM is the new disk credo. Memory is assumed to be the system of record, not the database. If you want data to be stored in a database and have the two kept in sync, then you'll have to add that layer yourself. All the standard memcached techniques should work as well for StateServer. Consider however that a database layer may not be needed. Reliability is handled by StateServer because it keeps multiple data copies, reroutes on failure, and has an option for geographical distribution for another layer of added safety. Storing to disk wouldn't make you any safer. Via email I asked them a few questions. The key question was how they stacked up against Memcached? As that is surely one of the more popular challenges they would get in any sales cycle, I was very curious about their answer. And they did a great job differentiation themselves. What did they say? First, for an in-depth discussion of their technology take a look ScaleOut Software Technology, but here a few of the highlights:

  • Platforms: .Net, Linux, Solaris
  • Languages: .Net, Java and C/C++
  • Transparent Services: server farm membership, object placement, scaling, recovery, creating and managing replicas, and handling synchronization on object access.
  • Performance: Scales with measured linear throughput gain to farms with 64 servers. StateServer was subjected to maximum access load in tests that ramped from 2 to 64 servers, with more than 2.5 gigabytes of cached data and a sustained throughput of over 92,000 accesses per second using a 20 Mbits/second Infiniband network. StateServer provided linear throughput increases at each stage of the test as servers and load were added.
  • Data cache only. Doesn't try to become middleware layer for executing jobs. Also will not sync to your database.
  • Local Cache View. Objects are cached on the servers where they were most recently accessed. Application developers can view the distributed cache as if it were a local cache which is accessed by the customary add, retrieve, update, and remove operations on cached objects. Object locking for synchronization across threads and servers is built into these operations and occurs automatically.
  • Automatic Sharding and Load Balancing. Automatically partitions all of distributed cache's stored objects across the farm and simultaneously processes access requests on all servers. As servers are added to the farm, StateServer automatically repartitions and rebalances the storage workload to scale throughput. Likewise, if servers are removed, ScaleOut StateServer coalesces stored objects on the surviving servers and rebalances the storage workload as necessary.
  • High Availability. All cached objects are replication on up to two additional servers. If a server goes offline or loses network connectivity, ScaleOut StateServer retrieves its objects from replicas stored on other servers in the farm, and it creates new replicas to maintain redundant storage as part of its "self-healing" process. Uses a quorum-based updating scheme.
  • Flexible Expiration Policies. Optional object expiration after sliding or fixed timeouts, LRU memory reclamation, or object dependency changes. Asynchronous events are also available to signal object expiration.
  • Geographical Scaleout. Has the ability to automatically replicate to a remote cache using the ScaleOut GeoServer option.
  • Parallel Query. Perform fully parallel queries on cached objects. Developers can attach metadata or "tags" to cached objects and query the cache for all matching objects. ScaleOut StateServer performs queries in parallel across all caching servers and employs patent-pending technology to ensure that query operations are both highly available and scalable. This is really cool technology that really leverages the advantage of in-memory databases. Sharding means you have a scalable system then can execute complex queries in parallel without you doing all the work you would normally do in a sharded system. And you don't have to resort to the complicated logics need for SimpleDB and BigTable type systems. Very nice.
  • Pricing: - Development Edition: No Charge - Professional Edition: $1,895 for 2 servers - Data Center Edition: $71,995 for 64 servers - GeoServer Option First two data centers $14,995, Each add'l data center $7,495. - Support: 25% of software license fee Some potential negatives about ScaleOut StateServer:
  • I couldn't find a developer forum. There may be one, but it eluded me. One thing I always look for is a vibrant developer community and I didn't see one. So if you have problems or want to talk about different ways of doing things, you are on your own.
  • The sales group wasn't responsive. I sent them an email with a question and they never responded. That always makes me wonder how I'll be treated once I've put money down.
  • The lack of developer talk made it hard for me to find negatives about the product itself, so I can't evaluate its quality in production. In the next section the headings are my questions and the responses are from ScaleOut Software.

    Why use ScaleOut StateServer instead of Memcached?

    I've [Dan McMillan, VP Sales] included some data points below based on our current understanding of the Memcached product. We don't use and haven't tested Memcached internally, so this comparison is based in part upon our own investigations and in part what we are hearing from our own customers during their evaluation and comparisons. We are aware that Memcached is successfully being used on many large, high volume sites. We believe strong demand for ScaleOut is being driven by companies that need a ready-to-deploy solution that provides advanced features and just works. We also hear that Memcached is often seen as a low cost solution in the beginning, but development and ongoing management costs sometimes far exceed our licensing fees. What sets ScaleOut apart from Memcached (and other competing solutions) is that ScaleOut was architected from the ground up to be a fully integrated and automated caching solution. ScaleOut offers both scalability and high availability, where our competitors typically provide only one or the other. ScaleOut is considered a full-featured, plug-n-play caching solution at a very reasonable price point, whereas we view Memcached as a framework in which to build your own caching solution. Much of the cost in choosing Memcached will be in development and ongoing management. ScaleOut works right out of the box. I asked ScaleOut Software founder and chief architect, Bill Bain for his thoughts on this. He is a long-time distributed caching and parallel computing expert and is the architect of ScaleOut StateServer. He had several interesting points to share about creating a distributed cache by using an open source (i.e. build it yourself) solution versus ScaleOut StateServer. First, he estimates that it would take considerable time and effort for engineers to create a distributed cache that has ScaleOut StateServer's fundamental capabilities. The primary reason is that the open source method only gives you a starting point, but it does not include most capabilities that are needed in a distributed cache. In fact, there is no built-in scalability or availability, the two principal benefits of a distributed cache. Here is some of the functionality that you would have to build:
  • Scalable storage and throughput. You need to create a means of storing objects across the servers in the farm in a way that will scale as servers are added, such as creating and managing partitions. Dynamic load balancing of objects is needed to avoid hot spots, and to our knowledge this is not provided in memcached.
  • High availability. To ensure that objects are available in the case of a server failure, you need to create replicas and have a means of automatically retrieving them in case a server fails. Also, just knowing that a server has failed requires you to develop a scalable heart-beating mechanism that spans all servers and maintains a global membership. Replicas have to be atomically updated to maintain the coherency of the stored data.
  • Global object naming. The storage, load-balancing, and high availability mechanisms need to make use of efficient, global object naming and lookup so that any client can access any object in the distributed cache, even after load-balancing or recovery actions.
  • Distributed locking. You need distributed locking to coordinate accesses by different clients so that there are not conflicts or synchronization issues as objects are read, updated and deleted. Distributed locks have to automatically recover in case of server failures.
  • Object timeouts. You also will need to build the capability for the cache to handle object timeouts (absolute and sliding) and to make these timeouts highly available.
  • Eventing. If you want your application to be able to catch asynchronous events such as timeouts, you will need a mechanism to deliver events to clients, and this mechanism should be both scalable and highly available.
  • Local caching. You need the ability to internally cache deserialized data on the clients to keep response times fast and avoid deserialization overhead on repeated reads. These local caches need to be kept coherent with the distributed cache.
  • Management. You need a means to manage all of the servers in the distributed cache and to collect performance data. There is no built-in management capability in memcached, and this requires a major development effort.
  • Remote client support. ScaleOut currently offers both a standard configuration (installed as a Windows service on each web server) and a remote client configuration (Installed on a dedicated cache farm).
  • ASP.Net/Java interoperability. Our Java/Linux release will offer true ASP.Net/Java interop, allowing you to share objects and manage sessions across platforms. Note: we just posted our "preview" release last week.
  • Indexed query functionality. Our forthcoming ScaleOut 4.0 release will contain this feature, which allows you to query the store to return objects based on metadata.
  • Multiple data center support. With our GeoServer product, you can automatically replicate cached information to up to 8 remote data centers. This provides a powerful solution for disaster recovery, or even "active-active" configurations. GeoServer's replication is both scalable and high available. In addition to the above, we hope that the fact ScaleOut Software provides a commercial solution that is reasonably priced, supported and constantly improved would be viewed as an important plus for our customers. In many cases, in-house and open source solutions are not supported or improved once the original developer is gone or is assigned to other priorities.

    Do you find yourself in competition with the likes of Terracotta, GridGain, GridSpaces, and Coherence type products?

    Our ScaleOut technology has previously been targeted to the ASP.Net space. Now that we are entering the Java/Linux space, we will be competing with companies like the ones you mentioned above, which are mainly Java/Linux focused as well. We initially got our start with distributed caching for ecommerce applications, but grid computing seems to be a strong growth area for us as well. We are now working with some large Wall Street firms on grid computing projects that involve some (very large) grid operations. I would like to reiterate that we are very focused on data caching only. We don't try to do job scheduling or other grid computing tasks, but we do improve performance and availability for those tasks via our distributed data cache.

    What architectures your customers are using with your GeoServer product?

    A. GeoServer is a newer, add-on product that is designed to replicate the contents of two or more geographically separated ScaleOut object stores (caches). Typically a customer might use GeoServer to replicate object data between a primary data center site and a DR site. GeoServer facilitates continuous (async.) replication between sites, so if site A goes offline, the other site B is immediately available to handle the workload. Our ScaleOut technology offers 3 primary benefits: Scalability, performance & high availability. From a single web farm perspective, ScaleOut provides high availability by making either 1 or 2 (this is configurable) replica copies of each master object and storing the replica on an alternate host server in the farm. ScaleOut provides uniform access to the object from any server, and protects the object in the case of a server failure. With GeoServer, these benefits are extended across multiple sites. It is true that distributed caches typically hold temporary, fast-changing data, but that data can still be very critical to ecommerce, or grid computing applications. Loss of this data during a server failure, worker process recycle or even a grid computation process is unacceptable. We improve performance by keeping the data in-memory, while still maintaining high availability.

    Related Articles

  • RAM is the new disk
  • Latency is Everywhere and it Costs You Sales - How to Crush it
  • A Bunch of Great Strategies for Using Memcached and MySQL Better Together
  • Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • Google's Paxos Made Live – An Engineering Perspective
  • Industry Chat with Bill Bain and Marc Jacobs - Joe Rubino interviews William L. Bain, Founder & CEO of ScaleOut Software and Marc Jacobs, Director at Lab49, on distributed caches and their use within Financial Services.

    Click to read more ...

  • Wednesday

    Updating distributed web applications

    Hi, we've got a web application, which runs without the common standalone application servers like tomcat or jboss, rather it runs with an embedded jetty server. Now we are planing to run instances of this application on multiple machines, with a load balancer serving the requests. The big question is: is there a common scenario on how to update these applications? Lets think of 10 instances on 10 machines (one instance per machine), where we want to update each of these applications version. The brute force approach would be, to stop all instances, update and then restart it. This is a lot of manual work ;) Another problem is down-time: so someone must only shutdown one server after another, but then there are multiple application versions around. Can someone please provide us with a hint for this problem? Perhaps papers, tools or something like that? Thanks a lot :)

    Click to read more ...


    A Scalable, Commodity Data Center Network Architecture

    Looks interesting... Abstract: Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

    Click to read more ...


    Forum sort order

    G'day, I noticed the default sort order for the forum is to show the posts with the most replies first. That seems a bit odd for a forum. Would it not make sense to show the posts with the most recently replies first? It is possible to re-sort the forum threads that way by clicking on the "Last post" header (twice). It would seem like a more sensible default. I've checked and I see the same behaviour as both a registered (logged in) and anonymous user. Cheers - Callum.

    Click to read more ...


    Code deployment tools

    G'day, I'm building an application to manage WordPress PHP code on many servers. Our application will push down code updates to each server, as well as performing backups and testing. I'm considering different methods of pushing updated code onto the individual servers. I'm considering something like Capistrano (I've no experience in Ruby though). I've also considered using subversion and then remotely calling svn commands via SSH. Are there any other tools specifically for this purpose? The servers will have persistent data (the WordPress databases) so I don't want to re-image them every update. Plus, they will each have a different set of plugins / themes, so building many images would be too complex. If there are any papers on code deployment, or other recommended reading, please point the links my way. Likewise, if anyone has any suggestions, or would like more details, just let me know. Cheers - Callum.

    Click to read more ...


    Wuala - P2P Online Storage Cloud

    How do you design a reliable distributed file system when the expected availability of the individual nodes are only ~1/5? That is the case for P2P systems. Dominik Grolimund, the founder of a Swiss startup Caleido will show you how! They have launched Wuala, the social online storage service which scales as new nodes join the P2P network. The goal of is to provide distributed online storage that is:

    • large
    • scalable
    • reliable
    • secure
    by harnessing the idle resources of participating computers. This challenge is an old dream of computer science. In fact as Andrew Tanenbaum wrote in 1995: "The design of a world-wide, fully transparent distributed filesystem fot simultaneous use by millions of mobile and frequently disconnected users is left as an exercise for the reader" After three years of research and development at at ETH Zurich, the Swiss Federal Institute of Technology on a distributed storage system, Caleido is ready to unveil the result: Wuala. Wuala is a new way of storing, sharing, and publishing files on the internet. It enables its users to trade parts of their local storage for online storage and it allows us to provide a better service for free. In this Google Tech Talk, Dominik will explain what Wuala is and how it works, and he will also show a demo. The availability problem is solved by redundancy (just like in Google File System). However simple replication techniques would result in too much overhead because of the low availability of the nodes. Instead Wuala employs erasure coding and splits the data into small pieces. Optimal erasure codes produce n/r fragments where any n fragments is sufficient to recover the original message. These pieces are then distributed in the P2P network providing good availability at a reasonable overhead. The P2P network consists of client, storage and routing nodes. The Wuala architecture uses a mix of regular and random graphs to optimize routing. Dominik also explains how Wuala architecture is designed to provide security and fairness. Wuala employs the 128 bit AES algorithm for encryption and the 2048 bit RSA algorithm for authentication. If you're interested in how Wuala manages encryption, have a look at their publication on Cryptree. They have also implemented distributed reputation audit and maintenance functions. Check out the Tech Talk! It is worth the time!

    Click to read more ...


    Many updates against MySQL

    Hello! My first post here, so be patient please. I am developing site where I have lots of static content. But on many pages I have query to update count of views. I would say this is may cause lots of problems and was interested in another solution like storing these counts somewhere else. As my knowledge is bit limited in this way, I am asking you. I can say I understand PHP(OOP ofc) and MySQL. Nowadays I am getting into servers. Other question I have is: I read about making lots of things static.(in Flickr Architecture) and am interested how they do static sites? Lets say they make photo page static? And rebuild when tagg or comment is added? I am bit interested in it as I want to learn Smarty better(newbie) and serving content. Moreover, how about PHP? I have read many books about PHP theoretically but would love to see some RL example of using objects and exceptions(mainly this as I don't completely understand it) to learn some good programming habits. So if you can help me with some example or resource, please do :) I know I've covered huge area of things but these are what makes me mad everyday. So please be patient :) Greetings.

    Click to read more ...