Mollom Architecture - Killing Over 373 Million Spams at 100 Requests Per Second

Mollom is one of those cool SaaS companies every developer dreams of creating when they wrack their brains looking for a viable software-as-a-service startup. Mollom profitably runs a useful service—spam filtering—with a small group of geographically distributed developers. Mollom helps protect nearly 40,000 websites from spam, including one of mine, which is where I first learned about Mollom. In a desperate attempt to stop spam on a Drupal site, where every other form of CAPTCHA had failed miserably, I installed Mollom in about 10 minutes and it immediately started working. That's the out of the box experience I was looking for.

From the time Mollom opened its digital inspection system they've rejected over 373 million spams and in the process they've learned that a stunning 90% of all messages are spam. This spam torrent is handled by only two geographically distributed machines that handle 100 requests/ second, each running a Java application server and Cassandra. So few resources are necessary because they've created a very efficient machine learning system. Isn't that cool? So, how do they do it?

Click to read more ...


Stuff The Internet Says On Scalability For February 4, 2011

Submitted for your reading pleasure...

  • Super Bowl Prediction: Pittsburgh 27, Green Bay 24. I'll be rooting for Green Bay, but the Pittsburgh defense will eventually win the day, beating back the fleet footed, quick tossing, and sharp shooting Aaron Rodgers. Roethlisberger will make exactly 3 plays that matter, but they'll be the right 3 plays.
  • Reddit is now at 1 billion page views a month. Congratulations!
  • Amazon S3 Cloud Stores 262 Billion Objects.  My god, it's full of stars...
  • Quora’s Technology Examined by Phil Whelan. Excellent detective work answering the question: How Does Quora Work?
  • Quotable Quotes:
    • @timoreilly: When hardware became commoditized, software was valuable. Now that software being commoditized, data is valuable. #strataconf
    • @coldfusionPaul: "Write someone a query, they'll go away for a day. Teach someone to query, they'll just go away." so, I use #NoSQL 555
    • @squarecog: To go *really* fast, you want to get rid of spokes in your wheels, and ditch tires. Also, turning is overrated. #nosql

Click to read more ...


Piccolo - Building Distributed Programs that are 11x Faster than Hadoop

Piccolo (not this or this) is a system for distributed computing, Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centersUnlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simultaneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Using an in-memory key-value store is a very different approach from the canonical map-reduce, which is based on using distributed file systems. The results are impressive:

Experiments have shown that Piccolo is fast and pro-vides excellent scaling for many applications. The performance of PageRank and k-means on Piccolo is 11×and 4× faster than that of Hadoop. Computing a PageR-ank iteration for a 1 billion-page web graph takes only 70 seconds on 100 EC2 instances. Our distributed webcrawler can easily saturate a 100 Mbps internet uplink when running on 12 machines.

Piccolo was presented at OSDI10. For the paper take a look at Piccolo: Building Fast, Distributed Programs with Partitioned Tables, here's the slide deck, and there's a video of the talk (very good).

Click to read more ...


Google Strategy: Tree Distribution of Requests and Responses

If a large number of leaf node machines send requests to a central root node then that root node can become overwhelmed:

  • The CPU becomes a bottleneck, for either processing requests or sending replies, because it can't possibly deal with the flood of requests.
  • The network interface becomes a bottleneck because a wide fan-in causes TCP drops and retransmissions, which causes latency. Then clients start retrying requests which quickly causes a spiral of death in an undisciplined system.

One solution to this problem is a strategy given by Dr. Jeff Dean, Head of Google's School of Infrastructure Wizardry, in this Stanford video presentation: Tree Distribution of Requests and Responses.

Instead of having a root node connected to leaves in a flat topology, the idea is to create a tree of nodes. So a root node talks to a number of parent nodes and the parent nodes talk to a number of leaf nodes. Requests are pushed down the tree through the parents and only hit a subset of the leaf nodes.

With this solution:

Click to read more ...


Sponsored Post: Karmasphere, Kabam, Opera Solutions, Percona, Appirio, Newrelic, Cloudkick, Membase, EA, Joyent, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • Percona Live to be held in San Francisco February 16th, 2011. A one day event run by the experts behind the MySQL Performance Blog.
  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.

Cool Products and Services

Click to read more ...


Stuff The Internet Says On Scalability For January 28, 2011

 Submitted for your reading pleasure...


Comet - An Example of the New Key-Code Databases

Comet is an active distributed key-value store built at the University of Washington. The paper describing Comet is Comet: An active distributed key-value store, there are also slides, and a MP3 of a presentation given at OSDI '10. Here's a succinct overview of Comet:

Today's cloud storage services, such as Amazon S3 or peer-to-peer DHTs, are highly inflexible and impose a variety of constraints on their clients: specific replication and consistency schemes, fixed data timeouts, limited logging, etc. We witnessed such inflexibility first-hand as part of our Vanish work, where we used a DHT to store encryption keys temporarily. To address this issue, we built Comet, an extensible storage service that allows clients to inject snippets of code that control their data's behavior inside the storage service.

I found this paper quite interesting because it takes the initial steps of collocating code with a key-value store, which turns it into what might called a key-code store. This is something I've been exploring as a way of moving behavior to data in order to overcome network limitations in the cloud and provide other benefits. An innovator in this area is the Alchemy Database, which has already combined Redis and Lua. A good platform for this sort of thing might be Node.js integrated with V8. This would allow complex Javascript programs to run in an efficient evented container. There are a lot of implications of this sort of architecture, more about that later, but the Comet paper describes a very interesting start.

From the abstract and conclusion:

Click to read more ...


Google Pro Tip: Use Back-of-the-envelope-calculations to Choose the Best Design

How do you know which is the "best" design for a given problem? If, for example, you were given the problem of generating an image search results page of 30 thumbnails, would you load images sequentially? In parallel? Would you cache? How would you decide?

If you could harness the power of the multiverse you could try every possible option in the design space and see which worked best. But that's crazy impractical, isn't it?

Another option is to consider the order of various algorithm alternatives. As a prophet for the Golden Age of Computational Thinking, Google would definitely do this, but what else might Google do?

Use Back-of-the-envelope Calculations to Evaluate Different Designs

Jeff Dean, Head of Google's School of Infrastructure Wizardry—instrumental in many of Google's key systems: ad serving, BigTable; search, MapReduce, ProtocolBuffers—advocates evaluating different designs using back-of-the-envelope calculations.

Click to read more ...


PaaS shouldn’t be built in Silos

Unlike many of the existing Platforms, in this second-generation phase, its not going to be enough to package and bundle different individual middleware services and products (Web Containers, Messaging, Data, Monitoring, Automation and Control, Provisioning) and brand them under the same name to make them look as one. (Fusion? Fabric? A rose is a rose by any other name - and in this case, it's not a rose.)

The second-generation PaaS needs to come with a holistic approach that couples all those things together and provide a complete holistic experience. By that I mean that if I add a machine into cluster, I need to see that as an increase in capacity on my entire application stack, the monitoring system needs to discover that new machine and start monitoring it without any configuration setup, the load-balancer need to add it to its pool and so forth.

Our challenge as technologists would be to move from our current siloed comfort zone. That applies not just to the way we design our application architecture but to the way we build our development teams, and the way we evaluate new technologies. Those who are going to be successful are those who are going to design and measure how well all their technology pieces work together before anything else, and who look at a solution without reverence for past designs.

full story ...


75% Chance of Scale - Leveraging the New Scaleogenic Environment for Growth

"I'll never need to scale so why bother? We aren't Twitter or Facebook or Google after all." This is the most common email I get, a question in the form of a thinly disguised rationalization for not having to worry about scaling. And in these days of giant transformer-like machines they are probably right. But what if there are Barry Bonds enhancing type forces at work that argue for the chances of your needing to scale being higher than you think?

And if that happens, how will you cross the scalability chasm? Will you want to completely change your architecture or evolve it from a tool-chain that was meant to scale from the start? Architecturally, that's the question you have to answer. Today's tool-chains are making it possible to grow a system from small to large without needing to implement complete architectural phase changes at various scale inflection points, but that's a different topic. We're trying to think about why you may actually need to scale, that is the question.

Tumblr is a good example of a product that grew beyond expectation because they managed both to execute and harness powerful growth factors. Tumblr is a "light" blogging service that probably didn't think they were Twitter or Facebook or Google either, but need to scale they did. From Tumblr:

Click to read more ...