Hot Scalability Links for April 30, 2010

  • I Want a New Data Store. Jeremy Zawodny of Craigslist wants a new database, one that can do what it should: perform alter table operations faster, has efficient queries when most of the data is on disk and not in RAM, and matches their data that now looks more document oriented than relational. A lot of people willing to help.
  • Computer Science Unplugged. An extensive collection of free resources that teach principles of Computer Science such as binary numbers, algorithms and data compression through engaging games and puzzles that use cards, string, crayons and lots of running around. And it's free! Fascinating Interview with Tim Bell on teaching complex computing concepts, creating makers not just users, and how to change schools. From O'Reilly Radar
  • Akamai’s Network Now Pushes Terabits of Data Every Second. Akamai handles 12 million requests per second, logs more than 500 billion requests for content per day, and sends 3.45 terabits per second of data.

Click to read more ...


Behind the scenes of an online marketplace

In a presentation originally held at the 4. O2 Hosting Event in Hamburg, I spoke about the technology at a large online marketplace in Germany called Hitmeister.

 Some of the topics discussed include:

  • what makes up a marketplace? technically
  • system principles
  • development patterns
  • tools philosophy
  • data model
  • hardware

I am looking forward to comments and suggestions for both the presentation and our work.


Product: SciDB - A Science-Oriented DBMS at 100 Petabytes

Scientists are doing it for themselves. Doing what? Databases. The idea is that most databases are designed to meet the needs of businesses, not science, so scientists are banding together at to create their own Domain Specific Database, for science. The goal is to be able to handle datasets in the 100PB range and larger.

SciDB, Inc. is building an open source database technology product designed specifically to satisfy the demands of data-intensive scientific problems. With the advice of the world's leading scientists across a variety of disciplines including astronomy, biology, physics, oceanography, atmospheric sciences, and climatology, our computer scientists are currently designing and prototyping this technology

The scientists that are participating in our open source project believe that the SciDB database — when completed — will dramatically impact their ability to conduct their experiments faster and more efficiently and further improve the quality of life on our planet by enabling them to run experiments that were previously impossible due to the limitations of existing database systems and infrastructure. Many of the world's leading computer scientists with expertise in database systems have contributed to the design and architecture of the system to meet the needs of the world's scientists.

SciDB looks like a cool project and follows what might be considered a trend, instead of beating a general tool into submission, build a specialized tool that does what you need it to do. More details about SciDB can be found in the paper A Demonstration of SciDB: A Science-Oriented DBMS. A nice succinct poster is available summarizing the product.

Some interesting bits from the paper:

Click to read more ...


Elasticity for the Enterprise -- Ensuring Continuous High Availability in a Disaster Failure Scenario

Many enterprises' high-availability architecture is based on the assumption that you can prevent failure from happening by putting all your critical data in a centralized database, back it up with expensive storage, and replicate it somehow between the sites. As I argued in one of my previous posts (Why Existing Databases (RAC) are So Breakable!) many of those assumptions are broken at their core, as storage is doomed to failure just like any other device, expensive hardware doesn’t make things any better and database replication is often not enough.

Click to read more ...


Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure

Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it?

That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day. So how does Dapper do what Dapper does?

Click to read more ...


Sponsored Post: Event - Social Developer Summit

Social Developer Summit - June 29, 2010 - San Franciso, CA

A meeting of the technically social - Building, scaling, and profiting in a social age

Whether it's social games, social news, social discovery, social search, or other forms of social solutions, developers today are facing new hurdles in building instantly scalable products. As new technologies emerge to address the challenges faced by social application developers, it's increasingly important to come together for knowledge sharing purposes.

Click to read more ...


The cost of High Availability (HA) with Oracle 

What's the cost of downtime to your business?  $100,000 per hour, $1,000,000 or more? The recent Volcanic ash that has grounded European flights is estimated to be costing the airlines $200M a day. In the IT world, High Availability (HA) architectures allow for disaster recovery as well as uninterrupted business continuity during system failure.

This post focuses on a customer’s backend, comprised of a business application stack supported by a dozen Oracle databases. They wish to equip this infrastructure with HA features and ensure that outages do not cost business. How do we address the challenge of pricing the complete solution, with hardware, software, services and annual support?



Strategy: Order Two Mediums Instead of Two Smalls and the EC2 Buffet

Vaibhav Puranik in Web serving in the cloud – our experiences with nginx and instance sizes describes their experience trying to maximum traffic and minimum their web serving costs on EC2. Initially they tested with two m1.small instance types and then they the switched to two c1.mediums instance types. The m1s are the standard instance types and the c1s are the high CPU instance types. Obviously the mediums have greater capability, but the cost difference was interesting:

Click to read more ...


Hot Scalability Links for April 16, 2010


Parallel Information Retrieval and Other Search Engine Goodness

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Click to read more ...