advertise
Wednesday
Mar192008

RAD Lab is Creating a Datacenter Operating System

The RAD Lab (Reliable Adaptive Distributed Systems Laboratory) wants to leapfrog the Big Switch and create The Next Big Switch, skipping the cloud/utility evolutionary stage altogether. This hyper-evolutionary niche buster develops technology so advanced the cloud disperses and you can go back to building your own personal datacenters again. Where Google took years to create their datacenters, using a prefab Datacenter Operating System you might create your own in a long holiday weekend. Not St. Patrick's of course. Their vision: Enable one person to invent and run the next revolutionary IT service, operationally expressing a new business idea as a multi-million-user service over the course of a long weekend. By doing so we hope to enable an Internet "Fortune 1 million". How? By wizardry in the form of a “datacenter operating system” created from a pinch of "statistical machine learning (SML)" and a tincture of "recent insights from networking and distributed systems." But like most magics it's not so outlandish once you understand it:

  • Virtual machines provide the OS mechanism.
  • SML enforces the overarching policy.
  • Tools collect sensor data from all the hardware and software components.
  • Actuators shutdown, reboot, or migrate services inside the datacenter.
  • Workload generators and application simulators to record behaviors of proprietary systems and then recreate them in a research environment.
  • Ruby on Rails is the likely programming language.
  • Chubby and MapReduce are the libraries.
  • Storage is via services like BigTable, Google File System, and Amazon’s Simple Storage Service.
  • Crash-only software design.
  • CAP (consistency, availability, partition-tolerance) based design strategies.
  • Improve the efficiency of power delivery and usage. The only new part would be the SML. All the rest is fairly standard by now, even if it's not yet available in a nice gift box at a discount store. And I am highly skeptical when people draw a big circle around the really tricky complex bits and say we'll solve all that with "statistical machine learning", but the idea is intriguing. The dramatic rise of cloud/utility computing makes the personal datacenter idea less appealing than it otherwise would have been. When datacenters were built from scratch by hardy settlers with nothing but flint knives and bear skins, a Datacenter OS would have been very exciting. But now, isn't leveraging multiple clouds a better strategy? After all, the DC OS really just packages best practices. It won't really innovate for you so you aren't gaining a competitive advantage or even a lower cost structure. And if that's the case, wouldn't I rather have someone else do all of the work? But I have high hopes I'll have my own personal power plant in the near future. Maybe one of the things it will power is my own personal datacenter!

    Related Articles

  • Home Page for RAD Lab - Reliable Adaptive Distributed Systems Laboratory
  • RADLab Technical Vision (2005)
  • CS 294-23, Software as a Service (Patterson/Fox/Sobel)
  • Internet-scale Computing: The Berkeley RADLab Perspective
  • CS 294-14: Architecture of Internet Datacenters. This a course at Berkeley and many classes have lecture notes. Very cool. PS Is it "datacenter" or "data center"? Both are used and it drives me crazy.

    Click to read more ...

  • Wednesday
    Mar192008

    Serving JavaScript Fast

    Cal Henderson writes at thinkvitamin.com: "With our so-called "Web 2.0' applications and their rich content and interaction, we expect our applications to increasingly make use of CSS and JavaScript. To make sure these applications are nice and snappy to use, we need to optimize the size and nature of content required to render the page, making sure we’re delivering the optimum experience. In practice, this means a combination of making our content as small and fast to download as possible, while avoiding unnecessarily refetching unmodified resources." A lot of good comments too.

    Click to read more ...

    Tuesday
    Mar182008

    Shared filesystem on EC2

    Hi. I'm looking for a way to share files between EC2 nodes. Currently we are using glusterfs to do this. It has been reliable recently, but in the past it has crashed under high load and we've had trouble starting it up again. We've only been able to restart it by removing the files, restarting the cluster, and filing it up again with our files from backup. This takes ages, and will take even longer the more files we get. What worries me is that it seems to make each node a point of failure for the entire system. One node crashes and soon the entire cluster has crashed. The other problem is adding another node. It seems like you have to take down the whole thing, reconfigure to include the new node, and restart. This kind of defeats the horizontal scaling strategy. We are using 2 EC2 instances as web servers, 1 as a DB master, and 1 as a slave. GlusterFS is installed on the web server machines as well as the DB slave machine (we backup files to s3 from this machine). The files are mostly thumbnails, but also some larger images and media files. Does anyone have a good solution for sharing files between EC2 nodes? I like the ThruDB [http://trac.thrudb.org/] concept of using the local filesystem as a cache for S3, but I'm not sure if ThruDB is mature enough yet. Or maybe some kind of distributed filesystem built on top of git would work? Any ideas? Thanks! ~rvr

    Click to read more ...

    Tuesday
    Mar182008

    Database War Stories #3: Flickr

    [Tim O'Reilly] Continuing my series of queries about how "Web 2.0" companies used databases, I asked Cal Henderson of Flickr to tell me "how the folksonomy model intersects with the traditional database. How do you manage a tag cloud?"

    Click to read more ...

    Tuesday
    Mar182008

    Database Design 101

    I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might be best to write the data in both formats at first and see what query speed is like once the tables fill up... Another potential issue would be the fact that inserts will be coming in batches of about 500 - 2000+ per user at a time which will be pretty intensive to pull off for the normalized table as there will need to be quite a few selects for each insert due to the fact that the artist, album or song may already be in the database or it may not requiring an insert. What do you all think?

    Click to read more ...

    Monday
    Mar172008

    Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

    Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world. Abstract: We describe a family of caching protocols for distributed networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems. Other excellent resources for learning more about consistent hashing are at:

  • Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • Consistent Hashing by Tom White. A good explanation and some actual Java code as an implementation.
  • Programmer’s Toolbox Part 3: Consistent Hashing by Tom Kleinpeter. Another good explanation with an emphasis on useful applications: load distribution on failure, load tuning by capacity, method for bringing servers on line, redundant caching to protect the database in case of failure.
  • Distributed Hash Tables: an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent (with extensions), eDonkey network, YaCy, and the Coral Content Distribution Network.
  • Chord - a peer-to-peer lookup algorithm. It allows a distributed set of participants to agree on a single node as a rendezvous point for a given key, without any central coordination.
  • Dynamo, Amazon's database uses consistent hashing.
  • Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution

    Click to read more ...

  • Monday
    Mar172008

    Microsoft's New Database Cloud Ready to Rumble with Amazon

    Update: Zdnet says Ozzie signals Microsoft’s surrender to the cloud. CD ROMs are to the internet as the internet is to the cloud and Microsoft aims to scratch and claw its way into this paradigm shift as well. The gloves are off. The tag line for Microsoft's new SQL Server Data Service is Your Data, Any Place, Any Time. Thems fighten' words. Microsoft is itch'n for a fight! Who will be Amazon's second? The service description: SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning. Sounds like a fast uppercut aimed squarely at SimpleDB's jaw. As a developer what do you need to know?

  • Highly available and highly scalable.
  • Targeted at applications that can tolerate high internet latencies. Primarily read-mostly data sets "storing data that is naturally partitionable into disjoint data sets requiring little or no cross-correlations."
  • Pricing isn't yet available.
  • No announced SLA, though you will be notified of downtime.
  • Access is via REST or SOAP.
  • Data is redundantly stored and backed up.
  • Geo-redundant data copies are maintained. Data is stored in large storage clusters in various Microsoft data centers located across North America. Plan to expand to more areas later.
  • The data model is flexible. No schemas required. "Simply add new attributes to your data set when needed, and the system will automatically store, index, and query your data accordingly."
  • String, numeric, datetime, and boolean data type are supported. All attributes are indexed.
  • The data model is: Customer { SSDS account (1..N) { Authority (1..N) { Container (0..N) { Entity (0..N) - Authorities give billing entities a way to organize their usage for accounting, security and co-location purposes. All containers under a single authority are provisioned within the same data center. As such authorities are the unit of geo-scale and geo-location. For example: Seattle or San Francisco. - Containers create contexts and scope for entity storage and query. For example, within its authorities, operations could choose to assign each member their own container, intended to contain a set of personal data for that member. Containers are the unit of consistency in the Microsoft SSDS service. For example: Autos for Sale, Services Offered. - Entities are the fundamental unit of storage in the system. Entities are a bag of scalars with no enforced type. For example, an individual member‟s jobs, educational institutions, contacts, recommendations, etc. could all be modeled as entities.
  • Data manipulation operations include: - Creation and deletion of containers. There are no updatable container properties. - Creation, replacement, and deletion of entities. - Retrieval of a single container in a serialized format. - Retrieval of a single entity in a serialized format.
  • A text based query language is supported that follow the LINQ pattern for C#. Only complete entities are returned. For example: To query addressed to a container to retrieve all entities in that container having a “City” property equal to “Seattle” and a “State” property equal to “WA” would be written as follows: from e in entities where e[“City”] == “Seattle” && e[“State”] == “WA” select e
  • Resource queries are supported. This URI returns all the entities in that container: http://mydomain.ssds.microsoft.com/ChildrensBooksContainer1
  • Customers will be able to associate entities with blobs which could be accessed as URL addressable resources.
  • SSDS is not just a hosted version of SQL Server. It's SQL Server deconstructed, running on blade servers with SATA drives.
  • A local on-premise version of SSDS will also be available to "allow users to better synchronize between the enterprise and the cloud, especially when handling reporting, analytics and business-intelligence tasks."
  • Not sure if you can do a full text search on a string property.
  • Not sure how large attributes can be. Should I store my 25MB documents as a string field so it can be searched? Or is that too expensive? Or is it even supported? Where Microsoft is behind Amazon is that they don't have EC2 and S3. Microsoft is having you traverse the internet for each database access. Now let''s say you have chunks of large storage stored off in a storage cloud. That's a lot of overhead. A major benefit of AWS is the free bandwidth and higher performance within the AWS cloud. If your application is completely hosted within the cloud the only slow part is from the server to the user's browser. All-in-all SSDS seems very comparable to and competitive with SimpleDB. Hard to say without solid pricing information. But it will be interesting to see how Microsoft's cloud strategy evolves and how we can all benefit from the competition.

    Related Articles

  • Home Page for SSDS
  • The Current Pros and Cons List for SimpleDB
  • Microsoft SQL Server Data Services
  • SQL Server Data Services FAQ
  • Microsoft launches its alternative to Amazon’s SimpleDB

    Click to read more ...

  • Sunday
    Mar162008

    Do you have any questions for the Elastra CEO?

    It looks like in the near future I'll have a chance to interview the Elastra CEO. Elastra provides standard databases--MySQL, EnterpriseDB and PostgreSQL-- on top of EC2 and S3. They are selling aggressive pricing, expandable and contactable database resource usage in response to demand, and a simple management and operations interface to well known databases deployed in a cloud. Elastra could be an important option for developers looking for a more traditional cloudy database. I was wondering if you guys had any suggestions for questions you would like answered? What would you like to know about their service? What are you looking for in a cloudy database? What would stop you from adopting it or what would make you decide to adopt it? Any ideas you have would help a lot and will probably be better than anything I have.

    Click to read more ...

    Sunday
    Mar162008

    Product: GlusterFS

    Adapted from their website: GlusterFS is a clustered file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. Storage bricks can be made of any commodity hardware such as x86-64 server with SATA-II RAID and Infiniband HBA). Cluster file systems are still not mature for enterprise market. They are too complex to deploy and maintain though they are extremely scalable and cheap. Can be entirely built out of commodity OS and hardware. GlusterFS hopes to solves this problem. GlusterFS achieved 35 GBps read throughput. The GlusterFS Aggregated I/O Benchmark was performed on 64 bricks clustered storage system over 10 Gbps Infiniband interconnect. A cluster of 220 clients pounded the storage system with multiple dd (disk-dump) instances, each reading / writing a 1 GB file with 1MB block size. GlusterFS was configured with unify translator and round-robin scheduler. The advantages of GlusterFS are: * Designed for O(1) scalability and feature rich. * Aggregates on top of existing filesystems. User can recover the files and folders even without GlusterFS. * GlusterFS has no single point of failure. Completely distributed. No centralized meta-data server like Lustre. * Extensible scheduling interface with modules loaded based on user's storage I/O access pattern. * Modular and extensible through powerful translator mechanism. * Supports Infiniband RDMA and TCP/IP. * Entirely implemented in user-space. Easy to port, debug and maintain. * Scales on demand.

    Related Articles

  • Technical Presentation on GlusterFS
  • Open Fest 5th Annual Conference
  • Zresearch
  • GlusterFS FAQ

    Click to read more ...

  • Saturday
    Mar152008

    New Website Design Considerations

    I am in the design phase of getting a website up and running that will have scalability as a main concern. I am looking for opinions on architecture and the like for this endeavor. The site has a few unique characteristics that make scalability difficult. Users will all have a pretty large amount of data that other users will be able to search. The site will be entirely based around search. The catch is that other users will be searching always with a stipulation of 'n' miles from me. I imagine that fact will kill the possibility of query caching for most searches. I have extensive experience with PHP and MYSQL, some experience with ASP.NET/C#, some experience with perl but can learn anything fast. The site will start out on a single server but I want to be 100% certain that I architect the code and databases such that scaling will be simple. What language should I code the site in? What DB would you use: Postgres, MYSQL, MSSQL, BerkelyDB? Should we shard the database by location? by user? not at all? What does everyone think for possible architectures on this?

    Click to read more ...