advertise
Monday
Dec012008

MySQL Database Scale-out and Replication for High Growth Businesses

It is widely recognized that MySQL is the most popular database software in the world. Since its inception in 1995, there have been 11 million product installations around the world in a wide variety of markets. There are more installations of MySQL in use today than any other database architecture. From startup companies hoping to be the next Web2.0 poster child to large global enterprises, the MySQL database architecture has proven to be flexible, extendable, scalable, and more than capable of filling high-capacity database roles in very different venues.

Click to read more ...

Monday
Dec012008

Sun FireTM X4540 Server as Backup Server for Zmanda's Amanda Enterprise 2.6 Software 

Sun FireTM X4540 Server as Backup Server for Zmanda's Amanda Enterprise 2.6 Software by Thomas Hanvey (Sun Microsystems) and Dmitri Joukovski and Ken Crandall (Zmanda) September, 2008 Explosive data growth, combined with demanding requirements for data availability, has placed a tremendous burden on IT operations staff at businesses of all sizes. Yet, many organizations do not have the staff or budget to purchase and manage complex and expensive backup and recovery software products. The Sun FireTM X4540 server can deliver massive storage capacity and remarkable throughput so it is well-suited as a nearline storage platform for backup and restore applications. Combining the power of the SolarisTM 10 Operating System with the data integrity and simplified administration of ZFS, the Sun Fire X4540 server can be an ideal candidate for streamlining and improving backup and restore operations. Amanda Enterprise Edition from Zmanda was designed to address these challenges, providing a backup and recovery solution that combines fast installation, simplified management, enterprise-class functionality, and low-cost subscription fees. As an open source product, Amanda Enterprise Edition uses only standard formats and tools, effectively freeing you from being locked into a vendor to recover your archived data. This guide discusses how to quickly configure the Sun Fire X4540 server as a backup server for Amanda Enterprise Edition software.

Click to read more ...

Monday
Dec012008

An Open Source Web Solution - Lighttpd Web Server and Chip Multithreading Technology  

With more users interacting, working, purchasing, and communicating over the network than ever before, Web 2.0 infrastructure is taking center stage in many organizations. Demand is rising, and companies are looking for ways to tackle the performance and scalability needs placed on Web infrastructure without raising IT operational expenses. Today companies are turning to efficient, high-performance, open source solutions as a way to decrease acquisition, licensing, and other ongoing costs and stay within budget constraints.

Click to read more ...

Sunday
Nov302008

Creating a high-performing online database 

Hi there, I have an idea for an online database that services a large number of people. I've been studying it for a while and it seems feasible to me to create it and get people to populate it. It will need time to grow but eventually it will get there. The model I'm looking at is IMDB, the depth of information is fascinating, yet it's fast, not so easy to use though, but it's pretty usable! What do you think I need to create a database an online database like IMDB. I know that IMDB power comes from it's information, not the design of the site. This is something I kind of figured out. But what I need to know is the best tools to publish database contents on the web, retrieve it in that fast way like IMDB. I'm sure that I will need to create data entry logs for my users to populate the database. What programming languages you suggest? development environment? approaches? your contribution is highly appreciated. Regards, Jalil

Click to read more ...

Monday
Nov242008

Scalability Perspectives #3: Marc Andreessen – Internet Platforms

Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind.

Marc Andreessen

Marc Andreessen is known as an internet pioneer, entrepreneur, investor, startup coach, blogger, and a multi-millionaire software engineer best known as co-author of Mosaic, the first widely-used web browser, and founder of Netscape Communications Corporation. He was the chair of Opsware, a software company he founded originally as Loudcloud, when it was acquired by Hewlett-Packard. He is also a co-founder of Ning, a company which provides a platform for social-networking websites. He has recently joined the Board of Directors of Facebook and eBay.

Marc is an investor in several startups including Digg, Metaplace, Plazes, Qik, and Twitter. His passion is to create new technologies, to start new companies, and to scale them up.

Internet Platforms Rule the Cloud

From Marc's Blog Post on the Three Kinds of Platforms:

One of the hottest of hot topics these days is the topic of Internet platforms, or platforms on the Internet. However, the concept of "platform" is also the focus of a swirling vortex of confusion -- lots of platform-related concepts, many of them highly technical, bleeding together; lots of people harboring various incompatible mental images of what's about to happen in our industry as a consequence of various platforms. I think this confusion is due in part to the term "platform" being overloaded and being used to mean many different things, and in part because there truly are a lot of moving parts at play that intersect in fascinating but complex ways.

Marc attempts to disentangle and examine the topic of "Internet platform" in detail. He has identified three distinct approaches to providing an Internet platform and shows us where each of the three approaches could go.

Internet Platforms Defined

A "platform" is a system that can be programmed and therefore customized by outside developers -- users -- and in that way, adapted to countless needs and niches that the platform's original developers could not have possibly contemplated, much less had time to accommodate.
...
The key term in the definition of platform is "programmed". If you can program it, then it's a platform. If you can't, then it's not.

The Internet gives rise to three new models of platform that is playing out in the Internet industry today. Marc calls these Internet platform models "levels", because as you go from Level 1 to Level 2 to Level 3, each kind of platform is harder to build, but much better for the developer. Further, each level typically supersets the levels below.

Level 1 Platform - "Access API"

This is the kind of Internet platform that is most common today. This is typically a platform provided in the form of a web services API -- which will typically be accessed using an access protocol such as REST or SOAP.

Architecturally, the key thing to understand about this kind of platform is that the developer's application code lives outside the platform -- the code executes somewhere else, on a server elsewhere on the Internet that is provided by the developer.

Examples: eBay, Paypal, Flickr, Delicious

  • The entire burden of building and running the application itself is left entirely to the developer
  • The easiest kind of Internet platform to create

Level 2 Platform - "Plug-In API"

This is the kind of platform approach that historically has been used in end-user applications to let developers build new functions that can be injected, or "plug in", to the core system and its user interface.

In the Internet realm, the first Level 2 platform that I'm aware of is the Facebook platform.

When you develop a Facebook app, you are not developing an app that simply draws on data or services from Facebook, as you would with a Level 1 platform. Instead, you are building an app that acts like a "plug-in" into Facebook -- your app literally shows up within the Facebook user experience, often as a box in the middle of a page that Facebook otherwise defines, such as a user profile page.

  • The third-party app itself lives outside the platform
  • The entire burden of building and running a Level 2 platform-based app is left entirely to the developer
  • Unlike a Level 1 platform where the burden of exposing the app to users is also placed on the developer, Level 2 Internet platforms -- as demonstrated by Facebook -- will be able to directly help their developers get users for their apps
  • Level 2 platforms are significantly harder to create than Level 1 platforms

Level 3 Platform - "Runtime Environment"

In a Level 3 platform, the huge difference is that the third-party application code actually runs inside the platform -- developer code is uploaded and runs online, inside the core system. For this reason, in casual conversation I refer to Level 3 platforms as "online platforms".

In addition, it is highly likely that a Level 3 platform will also superset Level 2 and Level 1 -- i.e., a Level 3 platform will typically also have some kind of plug-in API and some kind of access API.

Put in plain English? A Level 3 platform's developers upload their code into the platform itself, which is where that code runs. As a developer on a Level 3 platform, you don't need your own servers, your own storage, your own database, your own bandwidth, nothing... in fact, often, all you will really need is a browser. The platform itself handles everything required to run your application on your behalf.

Obviously this is a huge difference from Level 2. And this difference -- and what makes it possible -- is why I think Level 3 platforms are the future.

  • Level 3 platforms are much harder to build than Level 2 platforms.
  • The level of technical expertise required of someone to develop on your platform drops by at least 90%, and the level of money they need drops to $0
  • The Level 3 Internet platform approach is much more like the computer industry's typical platform (PC) model than Levels 2 or 1.

Who is building Level 3 Internet platforms?

Marc has built one - Ning has been built from the start to be a Level 3 platform.

Other Level 3 platforms include:

How will we see the new platforms of the future?

We are used to seeing platforms ship as products -- you buy and install a PC or a server and you build an app that runs on it, or equivalently you download and install an open source platform such as Perl or Ruby and you build an app that runs on it.

The platforms of the future won't be like that. The platforms of the future will be online services that you will tap into over the Internet, perhaps with nothing more running locally than a browser. They won't have anything you download, or even an SDK. They will look more like services than software. To paraphrase the Book of Matthew, "you will know them by their URLs".

Read Marc's full blog post for more details! Can you add more Level 3 Platforms?

Information Sources

Monday
Nov242008

Product: Scribe - Facebook's Scalable Logging System


In Log Everything All the Time I advocate applications shouldn't bother logging at all. Why waste all that time and code? No, wait, that's not right. I preach logging everything all the time. Doh. Facebook obviously feels similarly which is why they opened sourced Scribe, their internal logging system, capable of logging 10s of billions of messages per day. These messages include access logs, performance statistics, actions that went to News Feed, and many others.

Imagine hundreds of thousands of machines across many geographical dispersed datacenters just aching to send their precious log payload to the central repository off all knowledge. Because really, when you combine all the meta data with all the events you pretty much have a complete picture of your operations. Once in the central repository logs can be scanned, indexed, summarized, aggregated, refactored, diced, data cubed, and mined for every scrap of potentially useful information.

Just imagine the log stream from all of Facebook's Apache servers alone. Brutal. My guess is these are not real-time feeds so there are no streaming query issues, but the task is still daunting. Let's say they log 10 billion messages a day. That's over 1 million messages per second!

When no off the shelf products worked for them they built their own. Scribe can be downloaded from Sourceforge. But the real action is on their wiki. It's here you'll find some decent documentation and their support forums. Not much activity on the site so you haven't missed your chance to be a charter member of the Scribe guild.

A logging system has three broad components:

  • Client Code Interface - How does your code interact with the log system? Scribe doesn't do much for you here. There's a simple Thrift interface for logging from a large set of languages, but the bulk of the work is stull up to you.
  • Distribution System - This is were Scribe fits. It reliably (mostly) moves large numbers of messages around. A few error cases lead to data loss: 1) If a client can't connect to either the local or central scribe server the message will be loss; 2) If a scribe server crashes it could lose a small amount of data that's in memory but not on disk; 3) Some multiple component failure cases, such as a resender can't connect to any central server and its local disk fills up; 4) Some rare timeout conditions can lead to duplicate messages
  • Do Something Usefullizer - How do you do anything useful with over 1 million messages per second? Good question. Scribe doesn't help here. But Scribe will get your data their.

    I browsed around the source and it's a well crafted, straightforward socket server that forwards messages to other servers and can write messages to disk. Nothing fancy which is why it probably works for them. It's basic function is:

    Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed file system, or send them to another layer of scribe servers.
    It some ways it could be fancier. For example, there's no throttle on incoming connections so a server can chew up memory. And there is a max_msg_per_second throttle on message processing, but this is really to simple. Throttling needs to be adaptive based on local conditions and the conditions of down stream servers. Under load you want to push flow control back to the client so the data stays there until resources become available. Simple configuration file settings rarely work when the world starts getting weird.

    Client Code Interface

    Here's what the Thrift interface looks like:

    enum ResultCode
    {
    OK,
    TRY_LATER
    }

    struct LogEntry
    {
    1: string category,
    2: string message
    }

    service scribe extends fb303.FacebookService
    {
    ResultCode Log(1: list messages);
    }
    I know, I thought the same thing. Thank God there's another IDL syntax. We simply did not have enough of them. Thrift translates this IDL into the glue code necessary for making cross-language calls (marshalling arguments and responses over the wire). The Thrift library also has templates for servers and clients.

    Here's what a call looks like in PHP:

    $messages = array();
    $entry = new LogEntry;
    $entry->category = "buckettest";
    $entry->message = "something very interesting happened";
    $messages []= $entry;
    $result = $conn->Log($messages);


    Pretty simple. Usually in C++, for example, there's an elaborate set of macros for logging that provide sophisticated control of log generation. It might look something like:

    MSG(msg) - a simple message. It only prints out msg. None of the other information is printed out.
    NOTE(const char* name, const char* reason, const char* what, Module* module, msg) - something to take note of.
    WARN(const char* name, const char* reason, const char* what, Module* module, msg) - a warning.
    ERR(const char* name, const char* reason, const char* what, Module* module, msg) - an error occured.
    CRIT(const char* name, const char* reason, const char* what, Module* module, msg) - a critical error occurred.
    EMERG(const char* name, const char* reason, const char* what, Module* module, msg) - an emergency occurred.


    There's lots more to handle streams and behind the scenes things like time stamps, thread ids, function names, and line numbers. Scribe has wisely not done any of that. It has a RPC like interface to send a list of messages and that's it. It's up to you to write the wrappers.

    You'll no doubt have noticed Scribe only logs a category and message, both strings:

    Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the "store" abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.

    Distribution System

    The payload has whatever structure you give it. Scribe is policy neutral and doesn't push a logging model on you.

    The configuration file looks something like this:

    # BUCKETIZER TEST
    <store>
    category=buckettest
    type=buffer
    target_write_size=20480
    max_write_interval=1
    buffer_send_rate=2
    retry_interval=30
    retry_interval_range=10
    <primary>
    type=bucket
    num_buckets=6
    bucket_subdir=bucket
    bucket_type=key_hash
    delimiter=1
    <bucket>
    type=file
    fs_type=std
    file_path=/tmp/scribetest
    base_filename=buckettest
    max_size=1000000
    rotate_period=hourly
    rotate_hour=0
    rotate_minute=30
    write_meta=yes
    </bucket>
    </primary>
    <secondary>
    type=file
    fs_type=std
    file_path=/tmp
    base_filename=buckettest
    max_size=30000
    </secondary>
    </store>
    The types of stores currently available are:
  • file - writes to a file, either local or nfs.
  • network - sends messages to another scribe server.
  • buffer - contains a primary and a secondary store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary.
  • bucket - contains a large number of other stores, and decides which messages to send to which stores based on a hash.
  • null - discards all messages.
  • thriftfile - similar to a file store but writes messages into a Thrift TFileTransport file.
  • multi - a store that forwards messages to multiple stores.

    Certainly a flexible and useful set of logging capabilities. You can build a hierarchy of log servers to do pretty much anything you want. You could imagine have a log server on each server that has file store to handle upstream server failures. This log server forwards messages onto a centralized server for a datacenter. And all the datacenter servers forward their logs on to the centralized data warehouse. To scale adjust fan-in and fan-out as necessary.

    Do Something Usefullizer

    You may not have over 1 million log messages a second to process, but you are likely to have your own tanker trunk full of log messages. How do you do something useful with them?
  • Log messages stored in log files are next to useless. Grep'ing on a terabyte of logs to answer simple questions about your data just doesn't work.
  • You may have a sharded datawarehouse you can pump log messages into and do reasonably effective job of querying.
  • Or you can set up a HADOOP/HDFS. style system. The idea here is you need a distributed file system to handle the continual stream of log messages. And once you have all the data stored safely away you'll need to use map-reduce to do anything with such a large amount of data.

    If you want to ask, for example, how many of your users are from Asia, log files won't work. It's likely your data warehouse can't handle it. HADOOP/HDFS is a practical option.

    If that's the direction you are going what does it imply about your log system? I would say it makes even the simple category-payload system of Scribe overkill. The with a scalable backend is to move log payloads from applications to the centralized store as quickly as possible. By definition the central store can handle the load, so there's no reason to use intermediate servers to scale. From an application write directly to the central store, even from multiple datacenters. The payload structure is unimportant until it hits the central store. If the application can't hit the central store then it queues into the file system until it can. Ideally log messages never hit the file system until HDFS is writing them to their final destination. This makes for a low latency and high throughput logging and is even simpler than Scribe.

    If you don't have a scalable central store then Scribe is a good option. It gives you all the flexibility you need to compose your logging system in a way that is mostly reliabile and scalable.
  • Saturday
    Nov222008

    Google Architecture

    Update 2: Sorting 1 PB with MapReduce. PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

    Information Sources

  • Video: Building Large Systems at Google
  • Google Lab: The Google File System
  • Google Lab: MapReduce: Simplified Data Processing on Large Clusters
  • Google Lab: BigTable.
  • Video: BigTable: A Distributed Structured Storage System.
  • Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
  • How Google Works by David Carr in Baseline Magazine.
  • Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
  • Dare Obasonjo's Notes on the scalability conference.

    Platform

  • Linux
  • A large diversity of languages: Python, Java, C++

    What's Inside?

    The Stats

  • Estimated 450,000 low-cost commodity servers in 2006
  • In 2005 Google indexed 8 billion web pages. By now, who knows?
  • Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
  • Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
  • BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

    The Stack

    Google visualizes their infrastructure as a three layer stack:
  • Products: search, advertising, email, maps, video, chat, blogger
  • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
  • Computing Platforms: a bunch of machines in a bunch of different data centers
  • Make sure easy for folks in the company to deploy at a low cost.
  • Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

    Reliable Storage Mechanism with GFS (Google File System)

  • Reliable scalable storage is a core need of any application. GFS is their core storage platform.
  • Google File System - large distributed log structured file system in which they throw in a lot of data.
  • Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. - efficient distribution of operations across nodes to reduce bottlenecks
  • System has master and chunk servers. - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
  • A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
  • Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

    Do Something With the Data Using MapReduce

  • Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
  • MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
  • Why use MapReduce? - Nice way to partition tasks across lots of machines. - Handle machine failure. - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc. - Computation can automatically move closer to the IO source.
  • The MapReduce system has three different types of servers. - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
  • For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically. - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS. - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. - Shuffling aggregates key types. - The reductions sums up all the key value pairs and produces the final answer.
  • The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
  • Programs can be very small. As little as 20 to 50 lines of code.
  • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
  • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

    Storing Structured Data in BigTable

  • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
  • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
  • Commercial databases simply don't scale to this level and they don't work across 1000s machines.
  • By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
  • Machines can be added and deleted while the system is running and the whole system just works.
  • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
  • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
  • BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed. - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
  • A locality group can be used to physically store related bits of data together for better locality of reference.
  • Tablets are cached in RAM as much as possible.

    Hardware

  • When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
  • Use ultra cheap commodity hardware and built software on top to handle their death.
  • A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
  • Linux, in-house rack design, PC class mother boards, low end storage.
  • Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
  • Use a mix of collocation and their own data centers.

    Misc

  • Push changes out quickly rather than wait for QA.
  • Libraries are the predominant way of building programs.
  • Some are applications are provided as services, like crawling.
  • An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

    Future Directions for Google

  • Support geo-distributed clusters.
  • Create a single global namespace for all data. Currently data is segregated by cluster.
  • More and better automated migration of data and computation.
  • Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

    Lessons Learned

  • Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.
  • Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.
  • Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.
  • An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.
  • Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.
  • Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.
  • Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
  • Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.
  • Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

    Click to read more ...

  • Wednesday
    Nov192008

    High Definition Video Delivery on the Web?

    How would you architect and implement an SD and HD internet video delivery system such as the BBC iPlayer or Recast Digital's RDV1. What do you need to consider on top of the Lessons Learned section in the YouTube Architecture post? How is it possible to compete with the big players like Google? Can you just use a CDN and scale efficiently? Would Amazon's cloud services be a viable platform for high-definition video streaming?

    Click to read more ...

    Tuesday
    Nov182008

    Scalability Perspectives #2: Van Jacobson – Content-Centric Networking

    Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind.

    Van Jacobson

    Van Jacobson is a Research Fellow at PARC. Prior to that he was Chief Scientist and co-founder of Packet Design. Prior to that he was Chief Scientist at Cisco. Prior to that he was head of the Network Research group at Lawrence Berkeley National Laboratory. He's been studying networking since 1969. He still hopes that someday something will start to make sense.

    Scaling the Internet – Does the Net needs an upgrade?

    As the Internet is being overrun with video traffic, many wonder if it can survive. With challenges being thrown down over the imbalances that have been created and their impact on the viability of monopolistic business models, the Internet is under constant scrutiny. Will it survive? Or will it succumb to the burden of the billion plus community that is constantly demanding more and more? Does the Net Need an Upgrade? To answer this question a distinguished panel of Van Jacobson, Rick Hutley, Norman Lewis, David S. Isenberg has discussed the issue on the Supernova conference. In this compelling debate available on IT Conversations, the panel addresses the question and provides some differing perspectives. One of the perspectives is Content-based networking described by Van Jacobson.

    A New Way to look at Networking

    Today's research community congratulates itself for the success of the internet and passionately argues whether circuits or datagrams are the One True Way. Meanwhile the list of unsolved problems grows. Security, mobility, ubiquitous computing, wireless, autonomous sensors, content distribution, digital divide, third world infrastructure, etc., are all poorly served by what's available from either the research community or the marketplace. In this amazing Google Tech Talk Van Jacobson use various strained analogies and contrived examples to argue that network research is moribund because the only thing it knows how to do is fill in the details of a conversation between two applications. Today as in the 60s problems go unsolved due to our tunnel vision and not because of their intrinsic difficulty. And now, like then, simply changing our point of view may make many hard things easy.

    Content-centric networking

    The founding principle of Content-centric networking is that a communication network should allow a user to focus on the data he or she needs, rather than having to reference a specific, physical location where that data is to be retrieved from. This stems from the fact that the vast majority of current Internet usage (a "high 90% level of traffic") consists of data being disseminated from a source to a number of users. The current architecture of the Internet revolves around a conversation model, created in the 1970s to allow geographically distributed users to use a few big, immobile computers. The content-centric approach seeks to make the basic architecture of the network to current usage patterns. The new approach comes with a wide range of benefits, one of which being building security (both authentication and ciphering) into the network, and at the data level. Despite all its advantages, this idea doesn't seem to map very well to some of the current uses of the Web (like web applications, where data is generated on the fly according to user actions) or real-time applications like VoIP and instant messaging. But one can envision an Internet where content-centric protocols take care of the diffusion-based uses of the network, creating an overlay network, while genuine conversation-centric protocols stay on the current infrastructure.

    Solutions or workarounds?

    There are many solutions or workarounds for the problems posed by traditional conversation based networking such as Content Delivery Networks, caching, distributed filesystems, P2P and PKI. By taking the perspective of Van Jacobson we can investigate new dimensions of these problems. What could be the impact of this perspective on the future of the Internet architecture? What do you think? I recommend the New Way to Look at Networking video by Van Jacobson. He tells us the brief history of Networking from the phone system to the Internet and his vision for dissemination networking.

    Information Sources

    Click to read more ...

    Friday
    Nov142008

    Private/Public Cloud

    Data centers are reshaping themselves by taking ideas from public cloud providers, such as Amazon and Google. The idea is to make the data center more cost-effective by enabling on-demand utility-based computing rather than dedicated machines. At the same time, it is clear that to make IT operations more effective, it doesn't make sense to run all the applications that are currently hosted in a company's data center in the private cloud. This calls for an integration between private and public cloud. In this post i discuss some of the challenges involved in making that happen: 1. How do we design applications to be cloud-agnostic? 2. How do we enable seamless fail-over to a public cloud? 3. Future-proofing: There are many cases in which we can't make a clear decision as to where our application should be running at the time of writing or developing the application. We would like to be in a position to change the decision as to where our application will be running even after our application has been completely developed.

    Click to read more ...