advertise
Friday
Apr162010

Hot Scalability Links for April 16, 2010

Wednesday
Apr142010

Parallel Information Retrieval and Other Search Engine Goodness

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Click to read more ...

Tuesday
Apr132010

Strategy: Saving Your Butt With Deferred Deletes

Deferred Deletes is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  James Hamilton talks describes this strategy in his classic On Designing and Deploying Internet-Scale Services:

Never delete anything. Just mark it deleted. When new data comes in, record the requests on the way. Keep a rolling two week (or more) history of all changes to help recover from software or administrative errors. If someone makes a mistake and forgets the where clause on a delete statement (it has happened before and it will again), all logical copies of the data are deleted. Neither RAID nor mirroring can protect against this form of error. The ability to recover the data can make the difference between a highly embarrassing issue or a minor, barely noticeable glitch. For those systems already doing off-line backups, this additional record of data coming into the service only needs to be since the last backup. But, being cautious, we recommend going farther back anyway.

Click to read more ...

Monday
Apr122010

Poppen.de Architecture

This is a guest a post by Alvaro Videla describing their architecture for Poppen.de, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is Poppen.de?

Poppen.de (NSFW) is the top dating website in Germany, and while it may be a small site compared to giants like Flickr or Facebook, we believe it's a nice architecture to learn from if you are starting to get some scaling problems.

The Stats

  • 2.000.000 users
  • 20.000 concurrent users
  • 300.000 private messages per day
  • 250.000 logins per day
  • We have a team of eleven developers, two designers and two sysadmins for this project.

Click to read more ...

Friday
Apr092010

Vagrant - Build and Deploy Virtualized Development Environments Using Ruby

One of the cool things we are seeing is more tools and tool chains for performing very high level operations quite simply. Vagrant is such a tool for building and distributing virtualized development environments.

Web developers use virtual environments every day with their web applications. From EC2 and Rackspace Cloud to specialized solutions such as EngineYard and Heroku, virtualization is the tool of choice for easy deployment and infrastructure management. Vagrant aims to take those very same principles and put them to work in the heart of the application lifecycle. By providing easy to configure, lightweight, reproducible, and portable virtual machines targeted at development environments, Vagrant helps maximize your productivity and flexibility.

If you've created a build and deployment system before Vagrant does a lot of the work for you:

Click to read more ...

Thursday
Apr082010

Hot Scalability Links for April 8, 2010

  1. Scalability porn (SFW). Real time meter for the number of ads being served by doubleclick. Amazing. A constant ~390,000 impressions a second are being served and 25 trillion since 1996. Thanks to Mike Rhoads for title idea.
  2. Scalability? Don't worry. Application complexity? Worry by Joe McKendrick. The next challenge on enterprise agendas: application complexity. This is something that lots of hardware — whether from the cloud or internal data center — cannot fix
  3. Leo Laporte and Steve Gibson talked about how the iPad was a denial of service attack on UPS delivery schedules. UPS trucks were filled with iPads.
  4. Cassandra: Fact vs fiction. Jonathan Ellies puts the beatdown on Cassandra misinformation. Don't you dare say Cassandra can't work across datacenters!
  5. JIT'd code calling conventions. Cliff Click Jr shows how Java’s calling convention can match compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT'd code. Some assembly code required.
  6. Stonebraker on CAP Theorem and Databases. James Hamilton: Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.

Click to read more ...

Tuesday
Apr062010

Strategy: Make it Really Fast vs Do the Work Up Front

In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j, but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast.

The money quote pointing out the competing options is:

[Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The poster boys for the precompute strategy is SimpleGeo, a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and build two clusters: one for indexes and one for records. The records cluster is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. Ad hoc queries are not allowed. You can only search on what has been precomputed.

What I think Peter is saying is because a graph database represents the problem in such a natural way and graph navigation is so fast, it becomes possible to run even large complex queries in real-time. No special infrastructure is needed.

If you are creating a geo service, which approach would you choose? Before you answer, let's first ponder: is the graph database solution really solving the same problem as SimpleGeo is solving?

Click to read more ...

Tuesday
Apr062010

Sponsored Post: Event - Social Developer Summit

Social Developer Summit - June 29, 2010 - San Franciso, CA

A meeting of the technically social - Building, scaling, and profiting in a social age

Whether it's social games, social news, social discovery, social search, or other forms of social solutions, developers today are facing new hurdles in building instantly scalable products. As new technologies emerge to address the challenges faced by social application developers, it's increasingly important to come together for knowledge sharing purposes.

The first Social Developer Summit will bring together social application developers to discuss the challenges, solutions, and best practices for building applications in the rapidly expanding social web economy. At the Social Developer Summit, industry experts will share tips and case studies for building high performance social web products.

For more information please take a look at Social Developer Summit.

If you are interested in a sponsored post for an event, job, or product, please take a look at the advertising section. 

Monday
Apr052010

Intercloud: How Will We Scale Across Multiple Clouds?

In A Brief History of the Internet it was revealed that the Internet was based on the idea that there would be multiple independent networks of rather arbitrary design. The Internet as we now know it embodies a key underlying technical idea, namely that of open architecture networking. In this approach, the choice of any individual network technology was not dictated by a particular network architecture but rather could be selected freely by a provider and made to interwork with the other networks through a meta-level "Internetworking Architecture".  

With the cloud we are in the same situation today, just a layer or two higher up the stack. We have independent clouds that we would like to connect and work seamlessly together, preferably with the ease at which we currently connect nodes to a network and networks to the Internet. This technology seems to be called the Intercloud: an interconnected global "cloud of clouds" as apposed to the Internet which is a "network of networks."

person most often called the father of the Internet, says in Cloud Computing and the Internet that we are ripe for an Intercloud in the same way we were once ripe for the Internet:

Click to read more ...

Thursday
Apr012010

Hot Scalability Links for April 1, 2010

  1. Why NoSQL Will Not Die. Stephan Schmidt explains why you may wait a long time for NoSQL to go to that great bit bucket in the sky.
  2. DBMS Musings: Distinguishing Two Major Types of Column-Stores by Daniel Abadi. I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency, due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately.
  3. Cloud Economics, By The Square Foot by Rich Miller. But cloud computing offers a middle path, offering cost and usability advantages for customers, as well as an attractive return for providers.
  4. PostgreSQL: meet your queue by Theo Schlossnagle. I really think that cueing your database to publish over AMQP is the bees knees and it turns out I wasn't alone!
  5. Scaling GIS Data in Non-relational Data Store by Mike Malone. How SimpleGEO uses NoSQL and other technologies. Yes, the still use memcached. Caching ain’t going anywhere.
  6. CLTV45: The Evolution of the Graph Data Structure from Research to Production. In this recording from “NoSQL Live Boston” we learn how Graph Data Structures evolved from research into production.
  7. Spanner: Google’s next Massive Storage and Computation infrastructure by Royans. MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on. .
  8. Click to read more ...