Working With Large Data Sets

This is an excerpt from my blogpost Working With Large Data Sets...

For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here.

I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spamhaus had previously been throttled and disconnected, and then measure the duration until it appeared on Spamhaus. I set a job to pre-process a selected set of customers data and arbitrarily decided 66 days would be a good amount to process, as this was 2 months plus a little breathing room. I knew from my experience it was possible that it might take 2 months for a bad IP to be picked up by Spamhaus.

I extracted 28,204,693 distinct IPs, some of which were seen over million times in this data set.

Click here to


Applying Scalability Patterns to Infrastructure Architecture

Too often software design patterns are overlooked by network and application delivery network architects but these patterns are often equally applicable to addressing a broad range of architectural challenges in the application delivery tier of the data center.

Click to read more ...


Sponsored Post: Joyent, DeviantART, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Cool Products and Services

Click to read more ...


Playfish's Social Gaming Architecture - 50 Million Monthly Users and Growing

Ten million players a day and over fifty million players a month interact socially with friends using Playfish games on social platforms like The Facebook, MySpace, and the iPhone. Playfish was an early innovator in the fastest growing segment of the game industry: social gaming, which is the love child between casual gaming and social networking. Playfish was also an early adopter of the Amazon cloud, running their system entirely on 100s of cloud servers. Playfish finds itself at the nexus of some hot trends (which may by why EA bought them for $300 million and they think a $1 billion game is possible): building games on social networks, build applications in the cloud, mobile gaming, leveraging data driven design to continuously evolve and improve systems, agile development and deployment, and selling virtual good as a business model.

How can a small company make all this happen? To explain the magic I interviewed Playfish's Jodi Moran, Senior Director of Engineering, and Martin Frost, Chief Architext, first Engineer and Operations guy at Playfish. Lots of good stuff, so let's move on to the nitty gritty.

Click to read more ...


Hot Scalability Links For Sep 17, 2010

  • Disqus - Scaling the Worlds Largest Django App. Interesting overview of a commenting system with 75 million comments and 250 million visitors. Lots of good details on how they partition their database, testing, continuous integration, feature switches, caching, delayed signals, and more.
  • Things I learnt tracking a billion events in 24 hours: Know your host, Scaling isn't just servers, My servers need to talk to me more, Kill switches for users, What you don't know is the problem, Don't mix server roles, Know your most important users outside of your site.
  • Tweets of Gold:
    • georgebarnett: I read High Scalability for useful articles about large scaling. Sadly though, nothing useful ever shows up. #NoLongerBothering
    • northscale: wow that is fast! :) RT @cgoldberg: was just running > 100k ops/sec against my 2-node #Membase cluster... zazooom #nosql
    • turbofunctor: The root of many (horizontal) scalability problems is an application level access to a writable filesystem. (Thus, #appengine.)
    • gwenshap: is like Vogue for IT operations. Map-reduce is so last season.

    Click to read more ...


Strategy: Buy New, Don't Fix the Old

This strategy is from the Large Hadron Collider project:
Improvements in performance per Watt have caused CERN to no longer sign hardware support contracts longer than three years. Machines run until they die. They have a very high utilization of equipment (‘duty cycle’, 7 x 24 x 365). Replacing hardware makes more sense because of the lower cost and the power savings of new hardware.

How Can the Large Hadron Collider Withstand One Petabyte of Data a Second?

Why is there something rather than nothing? That's the kind of question the Large Hadron Collider in CERN is hopefully poised to answer. And what is the output of this beautiful 17-mile long, 6 billion dollar wabi-sabish proton smashing machine? Data. Great heaping torrents of Grand Canyon sized data. 15 million gigabytes every year. That's 1000 times the information printed in books every year. It's so much data 10,000 scientists will use a grid of 80,000+ computers, in 300 computer centers , in 50 different countries just to help make sense of it all.

How will all this data be collected, transported, stored, and analyzed? It turns out, using what amounts to sort of Internet of Particles instead of an Internet of Things.

Click to read more ...


Google's Colossus Makes Search Real-time by Dumping MapReduce

As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce, an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant, Google's new faster, real-time search system.

The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus.

Details are slowly coming out about their new goals and approach:

Click to read more ...


How did Google Instant become Faster with 5-7X More Results Pages?

We don't have a lot of details on how Google pulled off their technically very impressive Google Instant release, but in Google Instant behind the scenes, they did share some interesting facts:

  • Google was serving more than a billion searches per day.
  • With Google Instant they served 5-7X more results pages than previously.
  • Typical search results were returned in less than a quarter of second.
  • A team of 50+ worked on the project for an extended period of time.

Although Google is associated with muscular data centers, they just didn't throw more server capacity at the problem, they worked smarter too. What were their general strategies?

Click to read more ...


6 Scalability Lessons

Jesper Söderlund not only put together a few interesting scalability patterns, he also came up with a few interesting scalability lessons:

  • Lesson #1. Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE!
  • Lesson #2. Don't use out-of-the-box configurations.
  • Lesson #3. Single points of contention will eventually become a bottleneck.
  • Lesson #4. Plan in advance. 
  • Lesson #5. Offload your databases as much as possible.
  • Lesson #6. File systems matter and can run out of space / inodes.

For more details and explanations see the original post.