Strategy: Saving Your Butt With Deferred Deletes

Deferred Deletes is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  James Hamilton talks describes this strategy in his classic On Designing and Deploying Internet-Scale Services:

Never delete anything. Just mark it deleted. When new data comes in, record the requests on the way. Keep a rolling two week (or more) history of all changes to help recover from software or administrative errors. If someone makes a mistake and forgets the where clause on a delete statement (it has happened before and it will again), all logical copies of the data are deleted. Neither RAID nor mirroring can protect against this form of error. The ability to recover the data can make the difference between a highly embarrassing issue or a minor, barely noticeable glitch. For those systems already doing off-line backups, this additional record of data coming into the service only needs to be since the last backup. But, being cautious, we recommend going farther back anyway.

Click to read more ...

Apr122010 Architecture

This is a guest a post by Alvaro Videla describing their architecture for, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is (NSFW) is the top dating website in Germany, and while it may be a small site compared to giants like Flickr or Facebook, we believe it's a nice architecture to learn from if you are starting to get some scaling problems.

The Stats

  • 2.000.000 users
  • 20.000 concurrent users
  • 300.000 private messages per day
  • 250.000 logins per day
  • We have a team of eleven developers, two designers and two sysadmins for this project.

Click to read more ...


Vagrant - Build and Deploy Virtualized Development Environments Using Ruby

One of the cool things we are seeing is more tools and tool chains for performing very high level operations quite simply. Vagrant is such a tool for building and distributing virtualized development environments.

Web developers use virtual environments every day with their web applications. From EC2 and Rackspace Cloud to specialized solutions such as EngineYard and Heroku, virtualization is the tool of choice for easy deployment and infrastructure management. Vagrant aims to take those very same principles and put them to work in the heart of the application lifecycle. By providing easy to configure, lightweight, reproducible, and portable virtual machines targeted at development environments, Vagrant helps maximize your productivity and flexibility.

If you've created a build and deployment system before Vagrant does a lot of the work for you:

Click to read more ...


Hot Scalability Links for April 8, 2010

  1. Scalability porn (SFW). Real time meter for the number of ads being served by doubleclick. Amazing. A constant ~390,000 impressions a second are being served and 25 trillion since 1996. Thanks to Mike Rhoads for title idea.
  2. Scalability? Don't worry. Application complexity? Worry by Joe McKendrick. The next challenge on enterprise agendas: application complexity. This is something that lots of hardware — whether from the cloud or internal data center — cannot fix
  3. Leo Laporte and Steve Gibson talked about how the iPad was a denial of service attack on UPS delivery schedules. UPS trucks were filled with iPads.
  4. Cassandra: Fact vs fiction. Jonathan Ellies puts the beatdown on Cassandra misinformation. Don't you dare say Cassandra can't work across datacenters!
  5. JIT'd code calling conventions. Cliff Click Jr shows how Java’s calling convention can match compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT'd code. Some assembly code required.
  6. Stonebraker on CAP Theorem and Databases. James Hamilton: Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.

Click to read more ...


Strategy: Make it Really Fast vs Do the Work Up Front

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Click to read more ...


Sponsored Post: Event - Social Developer Summit

Social Developer Summit - June 29, 2010 - San Franciso, CA

A meeting of the technically social - Building, scaling, and profiting in a social age

Whether it's social games, social news, social discovery, social search, or other forms of social solutions, developers today are facing new hurdles in building instantly scalable products. As new technologies emerge to address the challenges faced by social application developers, it's increasingly important to come together for knowledge sharing purposes.

The first Social Developer Summit will bring together social application developers to discuss the challenges, solutions, and best practices for building applications in the rapidly expanding social web economy. At the Social Developer Summit, industry experts will share tips and case studies for building high performance social web products.

For more information please take a look at Social Developer Summit.

If you are interested in a sponsored post for an event, job, or product, please take a look at the advertising section. 


Intercloud: How Will We Scale Across Multiple Clouds?

In A Brief History of the Internet it was revealed that the Internet was based on the idea that there would be multiple independent networks of rather arbitrary design. The Internet as we now know it embodies a key underlying technical idea, namely that of open architecture networking. In this approach, the choice of any individual network technology was not dictated by a particular network architecture but rather could be selected freely by a provider and made to interwork with the other networks through a meta-level "Internetworking Architecture".  

With the cloud we are in the same situation today, just a layer or two higher up the stack. We have independent clouds that we would like to connect and work seamlessly together, preferably with the ease at which we currently connect nodes to a network and networks to the Internet. This technology seems to be called the Intercloud: an interconnected global "cloud of clouds" as apposed to the Internet which is a "network of networks."

person most often called the father of the Internet, says in Cloud Computing and the Internet that we are ripe for an Intercloud in the same way we were once ripe for the Internet:

Click to read more ...


Hot Scalability Links for April 1, 2010

  1. Why NoSQL Will Not Die. Stephan Schmidt explains why you may wait a long time for NoSQL to go to that great bit bucket in the sky.
  2. DBMS Musings: Distinguishing Two Major Types of Column-Stores by Daniel Abadi. I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency, due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately.
  3. Cloud Economics, By The Square Foot by Rich Miller. But cloud computing offers a middle path, offering cost and usability advantages for customers, as well as an attractive return for providers.
  4. PostgreSQL: meet your queue by Theo Schlossnagle. I really think that cueing your database to publish over AMQP is the bees knees and it turns out I wasn't alone!
  5. Scaling GIS Data in Non-relational Data Store by Mike Malone. How SimpleGEO uses NoSQL and other technologies. Yes, the still use memcached. Caching ain’t going anywhere.
  6. CLTV45: The Evolution of the Graph Data Structure from Research to Production. In this recording from “NoSQL Live Boston” we learn how Graph Data Structures evolved from research into production.
  7. Spanner: Google’s next Massive Storage and Computation infrastructure by Royans. MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on. .
  8. Click to read more ...


Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned

On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic, in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution.

The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valuable raw material.

By itself data is no more useful than a truck load of iron ore. Data must be brought to a factory. It must be purified, processed, and formed. That’s the job for a new field of science called Data Science. Yes, while you weren't looking a whole new branch of science was created. It makes sense in a way. Since data is a new kind of material we need a new profession paralleling that of the Material Scientist, someone who seeks to deeply understand data, the Data Scientist. We aren't so much in the age of data, as the age of data inference.

Click to read more ...


Strategy: Caching 404s Saved the Onion 66% on Server Time

In the article The Onion Uses Django, And Why It Matters To Us, a lot of interesting points are made about their ambitious infrastructure move from Drupal/PHP to Django/Python: the move wasn't that hard, it just took time and work because of their previous experience moving the A.V. Club website; churn in core framework APIs make it more attractive to move than stay; supporting the structure of older versions of the site is an unsolved problem; the built-in Django admin saved a lot of work; group development is easier with "fewer specialized or hacked together pieces"; they use IRC for distributed development; sphinx for full-text search; nginx is the media server and reverse proxy; haproxy made the launch process a 5 second procedure; capistrano for deployment; clean component separation makes moving easier; Git for version control; ORM with complicated querysets is a performance problem; memcached for caching rendered pages; the CDN checks for updates every 10 minutes; videos, articles, images, 404 pages are all served by a CDN.

But the most surprising point had to be:

Click to read more ...