Drop ACID and Think About Data

The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc:

Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself.
Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal to large companies like Amazon and Google or are commercial. A lot of the open source products aren't yet considered ready for prime-time and Bob encourages developers to join a project and make patches rather than start yet another half finished key-value store clone. From my monitoring of the interwebs this does seem to be happening and existing products are starting to mature. From all the choices discussed the column database Vertica seems closest to Bob's heart and it's the product they use. It supports clustering, column storage, compression, bitmapped indexes, bloom filters, grids, and lots of other useful features. And most importantly: it works, which is always a plus :-) Here's a summary of some of the points talked about in the presentation:
  • Video Presentation of Drop ACID and Think About Data


  • The claim to fame for relational databases is they make the ACID promise: * Atomicity - a transaction is all or nothing * Consistency - only valid data is written to the database * Isolation - pretend all transactions are happening serially and the data is correct * Durability - what you write is what you get
  • The problem with ACID is that it gives you too much, it trips you up when you are trying to scale a system across multiple nodes.
  • Down time is unacceptable. So your system needs to be reliable. Reliability requires multiple nodes to handle machine failures.
  • To make a scalable systems that can handle lots and lots of reads and writes you need many more nodes.
  • Once you try to scale ACID across many machines you hit problems with network failures and delays. The algorithms don't work in a distributed environment at any acceptable speed.


  • If you can't have all of the ACID guarantees it turns out you can have two of the following three characteristics: * Consistency - your data is correct all the time. What you write is what you read. *Availability - you can read and write and write your data all the time * Partition Tolerance - if one or more nodes fails the system still works and becomes consistent when the system comes on-line.


  • The types of large systems based on CAP aren't ACID they are BASE (har har): * Basically Available - system seems to work all the time * Soft State - it doesn't have to be consistent all the time * Eventually Consistent - becomes consistent at some later time
  • Everyone who builds big applications builds them on CAP and BASE: Google, Yahoo, Facebook, Amazon, eBay, etc

    Google's BigTable

  • Google BigTable - manages data across many nodes.
  • Paxos (Chubby) - distributed transaction algorithm that manages locks across systems.
  • BigTable Characteristics: * stores data in tablets using GFS, a distributed file system. * compression - great gains in throughput, can store more, reduces IO bottleneck because you have to store less so you have to talk to the disks less so performance improves. *single master - one node knows everything about all the other node (backed up and cached). *hybrid between row and column database ** row database - store objects together ** column database - store attributes of objects together. Makes sequential retrieval very fast, allows very efficient compression, reduces disks seeks and random IO. * versioning * bloom filters - allows data to be distributed across a bunch of nodes. It's a calculation on data that probabilistically maps the data to the nodes it can be found on. * eventually consistent - append only system using a row time stamp. When a client queries they get several versions and the client is in charge of picking the most recent.
  • Pros: * Compression is awesome. He really thinks compression is an important attribute of system. * Clients are probably simple. * Integrates with map-reduce.
  • Cons: * Proprietary to Google - You can't use it on your system. * Single-master - could be a downside but not sure.

    Amazon's Dynamo

  • A giant distributed hash table, called a key-value store.
  • Uses consistent hashing to distribute data to one or more nodes for redundancy and performance. * Consistent hashing - a ring of nodes and hash function picks which node(s) to store data * Consitency between nodes is based on vector clocks and read repair. * Vector clocks - time stamp on every row for every node that has written to it. * Read repair - When a client does a read and the nodes disagree on the data it's up to the client to select the correct data and tell the nodes the new correct state.
  • Pros: * No Master - eliminates single point of failure. * Highly Available for Write - This is the partition failure aspect of CAP. You can write to many nodes at once so depending on the number of replicas (which is configurable) maintained you should always be able to write somewhere. So users will never see a write failure. * Relatively simple which is why we see so many clones.
  • Cons: * Proprietary. * Clients have to be smart to handle read-repair, rebalancing a cluster, hashing, etc. Client proxies can handle these responsibilities but that adds another hop. * No compression which doesn't reduce IO. * Not suitable for column-like workloads, it's just a key-value store, so it's not optimized for analytics. Aggregate queries, for example, aren't in it's wheel house.

    Facebook's Cassandra

  • Peer-to-peer so no master like Dynamo
  • Storage model more like BigTable
  • Pros: * Open source. You can use it. * Incremental scalable - as data grows you can add more nodes to storage mesh. * Minimal administration - because it's incremental you don't have to do a lot of up front planning for migration.
  • Cons: * Not polished yet. It was built for in-box searching so may not be work well for other use cases. * No compression yet.

    Distributed Database Musings

  • Distributed databases are the new web framework.
  • None are awesome yet. No obvious winners.
  • There are many clones with partial features implemented. * For example Project Voldemort doesn't have rebalancing, no garbage collection.
  • Pick one and start submitting patches. Don't start another half-baked clone.

    Simple Key-Value Store

  • Some people are using simple key-value stores to replace relational database.
  • A key (array of bytes) maps using a hash to a value (a BLOB). It's like an associative array.
  • They are really fast and simple.

    Memcached Key-Value Stores

  • Is a key-value store that people use as a cache.
  • No persistence
  • RAM only
  • LRU so it throws data away on purpose when there's too much data
  • Lightening fast
  • Everyone uses it so well supported.
  • A good first strategy in removing load from the database.
  • Dealing with mutable data in a cache is really hard. Adding cache to an ACID system is something you'll probably get wrong and is difficult to debug because it does away with several ACID properties: * Isolation is gone with multiple writers. Everyone sees the current written value where in a database you see a consistent view of the database. * On a transaction fail the cache may reflect the new data when it has been rolled back in the database. * Dependent cache keys are difficult to program correctly because they aren't transactional. Consistency is impossible to keep. Update one key and what happens to the dependent keys? * It's complicated and you'll get it wrong and lose some of the consistency that the database had given your.

    Tokyo Cabinet/Tyrant Key-Value Store

  • Similar use cases as for BerkelyDB.
  • Disk persistence. Can store data larger than RAM.
  • Performs well.
  • Actively developed. Lots of developers adding new features (but not bug fixes).
  • Similar replication strategy to MySQL. Not useful for scalability as it limits the write throughput to one node.
  • Optional compressed pages so has some compression advantages.

    Redis Data Structure Store

  • Very new.
  • It's a data structure store not a key-value store, which means it understands your values so you can operate on them. Typically in a key-value store the values are opaque.
  • Can match on key spaces. You can look for all keys that match an expression.
  • Understands lists and sets. So you can do list and set operation in the process of the database server which is much more efficient because all the data doesn't have to be paged to the client. Updates can then b done atomically on the server side which is difficult to do on the client side.
  • Big downside is it requires that full data store in RAM. Can't store data on disk.
  • It is disk backed so it is reliable over a reboot, but still RAM limited.
  • Maybe useful as a cache that supports higher level operations than memcache.

    Document Databases

  • Schema-free. You don't have to say which attributes are in the values that are stored.
  • Makes schema migration easier because it doesn't care what fields you have. Applications must be coded to handle the different versions of documents.
  • Great for storing documents.

    CouchDB Document Database

  • Apache project so you can use it.
  • Written Erlang
  • Asynchronous replication. Still limited to the write speed of one node.
  • JSON based so easy to use on the web.
  • Queries are done in a map-reduce style using views. - A view is created by writing a Javascript function that is applied to all documents in the document store. This creates a matching list of documents. - Views are materialized on demand. Once you hit the view once it saves the list until an update occurs.
  • Neat admin UI.

    MongoDB Document Database

  • Written in C++
  • Significantly faster the CouchDB
  • JSON and BSON (binary JSON-ish) formats.
  • Asynchronous replication with auto-sharding coming soon.
  • Supports indexes. Querying a property is quick because an index is automatically kept on updates. Trades off some write speed for more consistent read spead.
  • Documents can be nested unlike CouchDB which requires applications keep relationships. Advantage is that the whole object doesn't have to be written and read because the system knows about the relationship. Example is a blog post and comments. In CouchDB the post and comments are stored together and walk through all the comments when creating a view even though you are only interested in the blog post. Better write and query performance.
  • More advanced queries than CouchDB.

    Column Databases

  • Some of this model is implemented by BigTable, Cassandra, and HyperTable.
  • Sequential reads are fast because data in a column is stored together.
  • Columns compress better than rows because the data is similar.
  • Each column is stored separately so IO is efficient as only the columns of interest are scanned. When using column database you are almost always scanning the entire column.
  • Bitmap indexes for fast sequential scans. * Turning cell values into 1 or more bits. * Compression reduces IO even further. * Indexes can be logical anded and ored together to know which rows to select. * Used for big queries for performing joins of multiple tables. When a row is 2 bits (for example) there's a lot less IO than working on uncompressed unbitmapped values.
  • Bloom Filters * Used by BigTable, Cassandra and other projects. * Probabilistic data structure. * Lossy, so you can lose data. * In exchange for losing data you can store all information in constant space * Gives you false positives at a known error rate. * Store bloom filter for a bunch of nodes. Squid uses this for its cache protocol. Knowing a bloom filter you can locally perform a computation to know which nodes the data may be on. If a bit is not set in the filter then the data isn't on the node. If a bit is set it may or may not be on the node. * Used for finding stuff and approximating counts in constant space.

    MonetDB Column Database

  • Research project which crashes a lot and corrupts your data.

    LucidDB Column Database

  • Java/C++ open source data warehouse
  • No clustering so only single node performance, but that can be enough for the applications column stores are good at.
  • No experience so can't speak to it.

    Vertica Column Database

  • The product they use.
  • Commercial column store based on C-store.
  • Clustered
  • Actually works.

    Related Articles

  • Availability & Consistency by Werner Vogels
  • BASE: An Acid Alternative by Dan Pritchett
  • MongoDB - a high-performance, open source, schema-free document-oriented data store that's easy to deploy, manage and use.
  • Vertica - blazing-fast data warehousing software
  • LucidDB - the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence.
  • CouchDB - a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.
  • Memcached Tag at High Scalability
  • Key-Value Store Tag at High Scalability
  • BigTable Tag at High Scalability
  • Dynamo.

    Click to read more ...

  • Monday


    The GigaOM Network today announces its second Structure conference after the runaway success of the 2008 event. The Structure 09 conference returns to San Francisco, Calif., on June 25th, 2009. Structure 09 ( is a conference designed to explore the next generations of Internet infrastructure. Over a year ago, The GigaOM Network Founder Om Malik saw that the platforms on which we have done business for over a decade were starting to provide diminishing returns, and smart money was seeking new options. Structure 09 looks at the changing needs and rapid growth in the Internet infrastructure sector, and this year's event will consider the impact of the global economy. "I cannot remember a time when a new technology had so much relevance to our industry as cloud computing does in the current economic climate," said The GigaOM Network Founder Om Malik. "We all need to find ways to leverage what we have and cut costs without compromising future options. Infrastructure On Demand and Cloud Computing are very strong avenues for doing so and we will look for what practicable advice we can bring to our audience." "Structure 08 was a great experience for our audience and partners, and I am very pleased to be bringing it back again this year," said Malik. "Along with GigaOM Lead Writer Stacey Higginbotham and the conference program committee, I am bringing together what I intend to be one of the most authoritative programs for the cloud computing and Internet infrastructure space." The GigaOM Network is also announcing early speaker selections. Confirmed speakers include: Marc Benioff - Chairman and CEO, Paul Sagan, President and CEO, Akamai Werner Vogels, CTO, Amazon Russ Daniels, VP and CTO, Cloud Services Strategy, HP Raj Patel, VP of Global Networks, Yahoo! Jonathan Heiliger, VP, Technical Operations, Facebook Greg Papadopoulos, CTO and EVP - Research and Development, Sun Microsystems Jack Waters, President, Global Network Services and CTO, Level 3 Communications Michael Stonebraker, PhD, CTO and Co-Founder, Vertica Systems David Yen, EVP and GM, Data Center Business Group, Juniper Networks Vijay Gill, VP Engineering, Google Yousef Khalidi, Distinguished Engineer, Microsoft Corporation Tobias Ford, Assistant VP, IT, AT&T Richard Buckingham, VP of Technical Operations, MySpace Lew Tucker, VP and CTO, Cloud Computing, Sun Microsystems Lloyd Taylor, VP Technical Operations, LinkedIn Michael Crandell, CEO and Founder, RightScale Jim Smith, General Partner, MDV-Mohr Davidow Ventures Bryan Doerr, CTO, Savvis Doug Judd, Principal Search Architect, Zvents Brandon Watson, Director, Azure Services Platform, Microsoft Jeff Hammerbacher, Chief Scientist, Cloudera Jason Hoffman, PhD, CTO, Joyent Mayank Bawa, CEO, Aster Data James Urquhart, Market Manager, Cloud Computing and Infrastructure, Cisco Systems Kevin Efrusy, General Partner, Accel Lew Moorman, CEO and Founder, Rackspace Joe Weinman, Strategy and Business Development VP, AT&T Business Solutions Peter Fenton, General Partner, Benchmark Capital David Hitz, Founder and Executive Vice President, NetApp James Lindenbaum, Co-Founder and CEO, Heroku Joseph Tobolski, Director of Cloud Computing, Accenture Steve Herrod, CTO and Sr. VP of R&D, VMware Further Details can be found at the Structure 09 Website High Scalability readers can register with a $50 discount at

    Click to read more ...


    FastBit: An Efficient Compressed Bitmap Index Technology

    Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.

    FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.
    It's not all just map-reduce and add more servers until your attic is full.

    Related Articles

  • FastBit: Digging through databases faster. An excellent description of how FastBit works, especially compared to b-trees.

    Click to read more ...

  • Wednesday

    Presentations: MySQL Conference & Expo 2009

    The Presentations of the MySQL Conference & Expo 2009 held April 20-23 in Santa Clara is available on the above link.

    They include:

    • Beginner's Guide to Website Performance with MySQL and memcached by Adam Donnison

    • Calpont: Open Source Columnar Storage Engine for Scalable MySQL DW by Jim Tommaney

    • Creating Quick and Powerful Web Applications with MySQL, GlassFish, and NetBeans by Arun Gupta

    • Deep-inspecting MySQL with DTrace by Domas Mituzas

    • Distributed Innodb Caching with memcached by Matthew Yonkovit and Yves Trudeau

    • Improving Performance by Running MySQL Multiple Times by MC Brown

    • Introduction to Using DTrace with MySQL by Vince Carbone

    • MySQL Cluster 7.0 - New Features by Johan Andersson

    • Optimizing MySQL Performance with ZFS by Allan Packer

    • SAN Performance on a Internal Disk Budget: The Coming Solid State Disk Revolution by Matthew Yonkovit

    • This is Not a Web App: The Evolution of a MySQL Deployment at Google by Mark Callaghan

    How to choice and build perfect server

    There are a lot of questions about the server components, and how to choice and/or build perfect server with consider the power consumption. So I decide to write about this topic.

    Key Points:

    • What kind of components the servers needs

    • The Green Computing and the Servers components.

    • How much power the server consume.

    • Choice the right components: Processors, HDD, RAID, Memory

    • Build Server, or buy?

    Some Questions from a newbie

    Hello highscalability world. I just discovered this site yesterday in a search for a scalability resource and was very pleased to find such useful information. I have some questions regarding distributed caching that I was hoping the scalability intelligentsia trafficking this forum could answer. I apologize for my lack of technical knowledge; I'm hoping this site will increase said knowledge! Feel free to answer all or as much as you want. Thank you in advance for your responses and thank you for a great resource! 1.) What are the standard benchmarks used to measure the performance of memcached or mySQL/memcached working together (from web 2.0 companies etc)? 2.) The little research I've conducted on this site suggests that most web 2.0 companies use a combination of mySQL and a hacked memcached (and potentially sharding). Does anyone know if any of these companies use an enterprise vendor for their distributed caching layer? (At this point in time I've only heard of Jive software using Coherence). 3.) In terms of a web 2.0 oriented startup, what are the database/distributed caching requirements typically needed to get off the ground and grow at a fairly rapid pace? 4.) Given the major players in the web 2.0 industry (facebook, twitter, myspace, PoF, Flickr etc, I'm ignoring google/amazon here because they have a proprietary caching layer) what is the most common, scalable back-end setup (mySQL/memcached/sharding etc)? What are its limitations/problems? What features does said setup lack that it really needs? Thank you so much for your insight!

    Click to read more ...


    Map-Reduce for Machine Learning on Multicore

    We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up.
    In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time.
    Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.

    Read more about this study here (PDF - you can download also)

    Click to read more ...


    Scale-up vs. Scale-out: A Case Study by IBM using Nutch/Lucene

    Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing.
    Scale-out solutions are particularly effective in high-throughput web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine.
    Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.

    Read more about scaling out/up and about the conclusions here (PDF - you can also download it)

    Click to read more ...


    Poem: Partly Cloudy

    As any reader of this site knows we're huge huge supporters of the arts. To continue that theme here's a visionary poem by Mason Hale. Few have reached for inspiration and found their muse in the emotional maelstrom that is cloud computing, but Mason has and the results speak for themselves: Partly Cloudy We have a dream A vision An aspiration To compute in the cloud To pay as we go To drink by the sip To add cores at our whim To write to disks with no end To scale up with demand And scale down when it ends Elasticity Scalability Redundancy Computing as a utility This is our dream Becoming reality But… There’s a hitch. There’s a bump in the road There’s a twist in the path There’s a detour ahead on the way to achieving our goal It’s the Database Our old friend He is set in his ways He deals in transactions to keeps things consistent He maintains the integrity of all his relations He eats disks for breakfast He hungers for RAM He loves queries and joins, and gives each one a plan He likes his schemas normal and strict His changes are atomic That is his schtick He’s an old friend as I said We all know him well So it pains me to say that in this new-fangled cloud He doesn’t quite fit Don’t get me wrong, our friend can scale as high as you want But there’s a price to be paid That expands as you grow The cost is complexity It’s more things to maintain More things that can go wrong More ways to inflict pain On the poor DBA who cares for our friend The one who backs him up and, if he dies, restores him again I love our old friend I know you do too But it is time for us all to own up to the fact That putting him into the cloud Taking him out of the rack Just causes us both more pain and more woe So… It’s time to move on Time to learn some new tricks Time to explore a new world that is less ACIDic It’s time to meet some new friends Those who were born in the cloud Who are still growing up Still figuring things out There’s Google’s BigTable and Werner’s SimpleDB There’s Hive and HBase and Mongo and Couch There’s Cassandra and Drizzle And not to be left out There’s Vertica and Aster if you want to spend for support There’s a Tokyo Cabinet and something called Redis I’m told It’s a party, a playgroup of newborn DB’s They scale and expand, they re-partition with ease They are new and exciting And still flawed to be sure But they’ll learn and improve, grow and mature They are our future We developers should take heed If our databases can change, then maybe Just maybe So can we

    Click to read more ...


    INFOSCALE 2009 in June in Hong Kong

    In case you are interested here's the info: INFOSCALE 2009: The 4th International ICST Conference on Scalable Information Systems. 10-12 June 2009, Hong Kong, China. In the last few years, we have seen the proliferation of the use of heterogeneous distributed systems, ranging from simple Networks of Workstations, to highly complex grid computing environments. Such computational paradigms have been preferred due to their reduced costs and inherent scalability, which pose many challenges to scalable systems and applications in terms of information access, storage and retrieval. Grid computing, P2P technology, data and knowledge bases, distributed information retrieval technology and networking technology should all converge to address the scalability concern. Furthermore, with the advent of emerging computing architectures - e.g. SMTs, GPUs, Multicores. - the importance of designing techniques explicitly targeting these systems is becoming more and more important. INFOSCALE 2009 will focus on a wide array of scalability issues and investigate new approaches to tackle problems arising from the ever-growing size and complexity of information of all kinds. For further information visit

    Click to read more ...