Stuff the Internet Says on Scalability For December 17th, 2010

  • If you missed it here's a link to my webinar and here's the slidedeck for the talk with a buch of additional slides that I didn't have a chance to talk about. The funky picture of Lincoln is classic.
  • Can MySQL really handle 1,000,000 req/sec? Sure, when you turn it into a NoSQLish database, skip all the SQL processing, and access the backend store directly. Percona is making this possible with their HandlerSocket plugin based on the work of Yoshinori Matsunobu.
  • Quotable Quotes:
    • @labsji: If SQL is an abstraction of Big machines....NoSQL is an abstration of distributed computing.
    • : man this eventual consistency #nosql thingy makes #facebook even more annoying. "you have a new comment, no you dont"
  • Nice racks. Time has pictures of a Facebook datacenter. 

Click to read more ...


7 Design Patterns for Almost-infinite Scalability

Good article from summarizing design patterns from Pat Helland's amazing paper Life beyond Distributed Transactions: an Apostate's Opinion.

  1. Entities are uniquely identified - each entity which represents disjoint data (i.e. no overlap of data between entities) should have a unique key.
  2. Multiple disjoint scopes of transactional serializability - in other words there are these 'entities' and that you cannot perform atomic transactions across these entities.
  3. At-Least-Once messaging - that is an application must tolerate message retries and out-of-order arrival of messages.
  4. Messages are adressed to entities - that is one can't abstract away from the business logic the existence of the unique keys for addresing entities. Addressing however is independent of location.
  5. Entities manage conversational state per party - that is, to ensure idemptency an entity needs to remember that a message has been previously processed. Furthermore, in a world without atomic transactions, outcomes need to be 'negotiated' using some kind of workflow capability.
  6. Alternate indexes cannot reside within a single scope of serializability - that is, one can't assume the indices or references to entities can be update atomically. There is the potential that these indices may become out of sync.
  7. Messaging between Entities are Tentative - that is, entities need to accept some level of uncertainty and that messages that are sent are requests form commitment and may possibly be cancelled.

The article then compares how these principles compare so of the design principles used to develop S3: 

Click to read more ...


Still Time to Attend My Webinar Tomorrow: What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications

It's time to do something a little different and for me that doesn't mean cutting off my hair and joining a monastery, nor does it mean buying a cherry red convertible (yet), it means doing a webinar!

  • On December 14th, 2:00 PM - 3:00 PM EST, I'll be hosting What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications.
  • The webinar is sponsored by VoltDB, but it will be completely vendor independent, as that's the only honor preserving and technically accurate way of doing these things.
  • The webinar will run about 60 minutes, with 40 minutes of speechifying and 20 minutes for questions.
  • The hashtag for the event on Twitter will be SQLNoSQL. I'll be monitoring that hashtag if you have any suggestions for the webinar or if you would like to ask questions during the webinar. 

Click to read more ...


How To Get Experience Working With Large Datasets

The Giant Twins

I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.

Click to read more ...


Sponsored Post: Joyent, Membase, Appirio, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

Cool Products and Services

Click to read more ...


What the heck are you actually using NoSQL for?

It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together?

In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in.

Click to read more ...


GPU vs CPU Smackdown : The Rise of Throughput-Oriented Architectures

In some ways the original Amazon cloud, the one most of us still live in, was like that really cool house that when you stepped inside and saw the old green shag carpet in the living room, you knew the house hadn't been updated in a while. The network is a little slow, the processors are a bit dated, and virtualization made the house just feel smaller. It has been difficult to run high bandwidth or low latency workloads in the cloud. Bottlenecks everywhere. Not a big deal for most applications, but for many high performance applications (HPC) it was a killer.

In a typical house you might just do a remodel. Upgrade a few rooms. Swap out builder quality appliances with gleaming stainless steel monsters. But Amazon has a big lot, instead of remodeling they simply keep adding on entire new wings, kind of like the Winchester Mystery House of computing.

The first new wing added was a CPU based HPC system featuring blazingly fast Nehalem chips, virtualization replaced by a close to metal Hardware Virtual Machine (HVM) architecture, and the network is a monster 10 gigabits with the ability to specify placement groups to carve out a low-latency, high bandwidth cluster. Bottlenecks removed. Most people still probably don't even know this part of the house exists.

The newest addition is a beauty, it's a graphics processing unit (GPU) cluster as described by Werner Vogels in Expanding the Cloud - Adding the Incredible Power of the Amazon EC2 Cluster GPU Instances . It's completely modern and contemporary. The shag carpet is out. In are Nvidia M2050 GPU based clusters which make short work of applications in the sciences, finance, oil & gas, movie studios and graphics.

Click to read more ...


8 Commonly Used Scalable System Design Patterns

Ricky Ho in Scalable System Design Patterns has created a great list of scalability patterns along with very well done explanatory graphics. A summary of the patterns are:

  1. Load Balancer - a dispatcher determines which worker instance will handle a request based on different policies.
  2. Scatter and Gather - a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
  3. Result Cache - a dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
  4. Shared Space - all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
  5. Pipe and Filter - all workers connected by pipes across which data flows.
  6. MapReduce -  targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
  7. Bulk Synchronous Parallel - a  lock-step execution across all workers, coordinated by a master.
  8. Execution Orchestrator - an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.

Sponsored Post: Cloudkick, Strata, Undertone, Joyent, Appirio, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

Cool Products and Services

Click to read more ...


NoCAP – Part III – GigaSpaces clustering explained..

In many of the recent discussions on the design of large scale systems (a.k.a. Web Scale) it was argued that the right set of tradeoffs for building large scale systems would be to give away Consistency for Availability and Partition tolerance. Those arguments relied on the foundation of the CAP theorem developed in early 2000-2002. One of the core principals behind the CAP theorem is that you must choose two out of the three CAP properties. In many of the transactional systems giving away consistency is either impossible or yields a huge complexity in the design of those systems. In this series of posts, I've tried to suggest a different set of tradeoffs in which we could achieve scalability without compromising on consistency. I also argued that rather than choosing only two out of the three CAP properties we could choose various degrees of all three. The degrees would be determined by the most likely availability and partition tolerance scenarios in our specific application.  The suggested model was based on the experience we had in GigaSpaces over the course of the past years and was successfully deployed in many mission critical systems today in Finance, Telco and ecommerce business. I hope that through the sharing of this experience we could come up with a broader set of patterns on how to build large scale systems that would fit also to mission critical transactional systems. Read more...