- Eventual Consistency by Example by Sergio Bossa. Attempts to clear up some misconceptions about eventual consitency as discussed in Amazon's Dynamo paper.
- Boston Big Data Summit keynote outline by Curt Monash. Interesting topics: Big Data and the cloud actually have relatively little to do with each other and The NoSQL movement is a lot like the Ron Paul campaign.
- I think RDBMS has set the industry back by 10 years by Henry G. Baker, Ph.D, from 1992. I can categorically state that relational databases set the commercial data processing industry back at least ten yearsand wasted many of the billions of dollars that were spent on data processing. Henry thought OO databases would change things. They didn't. The question is why?
- Intel cloud service tests the scalability of your code. Intel has a cloud based tool that can test how your application will perform on will on a number of multicore processor configurations -- 1, 2, 4, 8, or 16 hardware threads.
- Mapreduce 1, a lecture by Brian Harvey.
- Gear6 has released a software version of their cache product. Interesting departure from the appliance model. Appliances are good because they allow you complete control and something to hang some margin off of. Yet if you want to sell into the cloud you have to build software components, not a hardware solution. Seems like a good idea for those who want a tricked out memcached solution out of the box.
- Hadoop at Twitter (part 1): Splittable LZO Compression. How Twitter is using Hadoop to analyze a tweasure trove of tweets.
- A funny/insightful/sad/truish Dilbert cartoon on how clouds fit into Dilbert's world.
Contributed by Wolfgang Gentzsch:
Now that we have a new computing paradigm, Cloud Computing, how can Clouds help our data? Replace our internal data vaults as we hoped Grids would? Are Grids dead now that we have Clouds? Despite all the promising developments in the Grid and Cloud computing space, and the avalanche of publications and talks on this subject, many people still seem to be confused about internal data and compute resources, versus Grids versus Clouds, and they are hesitant to take the next step. I think there are a number of issues driving this uncertainty.
read more at: BigDataMatters.com
You don't even have to make a bid, Randy Shoup, an eBay Distinguished Architect, gives this presentation on how eBay scales, for free. Randy has done a fabulous job in this presentation and in other talks listed at the end of this post getting at the heart of the principles behind scalability. It's more about ideas of how things work and fit together than a focusing on a particular technology stack.
In case you weren't sure, eBay is big, with lots of: users, data, features, and change...
- Over 89 million active users worldwide
- 190 million items for sale in 50,000 categories
- Over 8 billion URL requests per day
- Hundreds of new features per quarter
- Roughly 10% of items are listed or ended every day
- In 39 countries and 10 languages
- 70 billion read / write operations / day
- Processes 50TB of new, incremental data per day
- Analyzes 50PB of data per day
Think of building websites as engineering composite materials. A composite material is when two or more materials are combined to create a third material that does something useful that the components couldn't do on their own. Composites like reinforced concrete have revolutionized design and construction. When building websites we usually bring different component materials together, like creating a composite, to get the features we need rather than building a completely new thing from scratch that does everything we want.
This approach has been seen as a hack because it leads to inelegancies like data duplication; great gobs of component glue; consistency issues; and messy operations. But what if the the composite approach is really a strength, not a hack, but a messy part of the world that needs to be embraced rather than belittled?
They key is to see data as a material. Right now we are arguing which is the best single material to build with. Is it NoSQL, relational, massively parallel, graph, in-memory, or something else entirely? It all seems a bit crazy. Each material has both limits and capabilities. What we need to think of building is a composite material that combines the best characteristics of what is available into something better.
Jonathan Ellis reviews in the NoSQL Ecosystem the origin of the NoSQL movement and 10 different NoSQL products and how their 1) support for multiple datacenters, 2) the ability to add new machines to a live cluster transparently to the your applications, 3) Data Model, 4) Query API, 5) Persistence Design. The 10 systems reviewed are: Cassandra, CouchDB, HBase, MongoDB, Neo4J, Redis, Riak, Scalaris, Tokyo Cabinet, Voldemort.
A very thorough and thoughtful article on the entire NoSQL space. It's clear from the article that NoSQL is not monolithic, there is a very wide variety of approaches to not being a relational database.
Queuing work for processing in the background is a time tested scalability strategy. Queuing also happens to be one of those much needed tools where it easy enough to forge for your own that we see a lot of different versions made. Resque is GitHub's take on a job queue and they've used it to process million and millions of jobs so far.
What is Resque?
Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later. Background jobs can be any Ruby class or module that responds to
perform. Your existing classes can easily be converted to background jobs or you can create new classes specifically to do work. Or, you can do both.
GitHub tried and considered many other systems: SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, beanstalkd, AMQP, and Kestrel, but found them all wanting in one way are another. The latency for SQS was too high. Others didn't make full use of Ruby. Others still had a lot of overhead. Some didn't have enough features. And still others weren't reliable enough.
NorthScale's Steven Yen in his highly entertaining NoSQL is a Horseless Carriage presentation has come up with a NoSQL taxonomy that thankfully focuses a little more on what NoSQL is, than what it isn't:
- memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
- keyspace, flare, schema‐free, RAMCloud
- eventually‐consistent key‐value‐store
- dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
- tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
- data‐structures server
- gigaspaces, coord, apache river
- object database
- ZopeDB, db4o, Shoal
- document store
- CouchDB, Mongo, Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
- wide columnar store
- BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
"Who will win?" Steven asks. He answers: the most approachable API with enough power will win. Steven touts the contender with the most devastating knock out punch will be document stores because "everyone groks documents." Though the thought is there will be just a few winners and products will converge in functionality.
Steven is banking on the "worse is better" model of dominance, which is hard to argue with as it has been so successful an adoption pattern in our field. The convergence idea is something I also agree with. What we have now are a lot features masquerading as products. Over time they will merge together to become more full featured offerings.
The key question though is what is enough power to win? Just getting a value back for a key won't be enough. Who are you putting your money on?
With so many database options available these days, like for the rest of life, it's natural to wonder how it all fits together. Amazon complicated, or rather expanded the available options by introducing RDS, their relational database service. RDS is MySQL safely cocooned as a manageable cloud element, resting boldly within an energy providing elastic CPU pool, supported by a virtually infinite supply of very capable virtualized storage .
MySQL in AWS is now easy to start, stop, monitor, backup, snapshot, expand, and effortlessly move up and down the instance hierarchy. What it's not, contrary to what you might expect, is a scale-out solution, it's a scale-up solution. You get more by buying a bigger instance, not by horizontally adding more instances. There's a limit. Admittedly a larger limit now with Amazon's new high memory instances.
That's OK, well maybe not for people who helped grow Amazon's ecosystem by offering a similar product, but so many projects use MySQL that this is a big win for a lot of people. It makes life easier even if the promise of infinite relational database storage is yet to be realized.
If one of the reasons you were considering using a Platform as a Service is to knock the database item off your worry list, RDS is one more reason to consider playing your own general contractor and orchestrating all the elements together yourself. As more services become packaged into cloud capable components this is likely how many systems will be bolted together in the future.
But we are left wondering, how RDS fits together with SimpleDB and all the other database options?
In many posts, such as: The Future of the Parallelism and its Challenges I mentioned that synchronization the access to the shared resource is the major challenge to write parallel code.
The synchronization and coordination take long time from the overall execution time, which reduce the benefits of the parallelism; the synchronization and coordination also reduce the scalability.
There are many forms of synchronization and coordination, such as:
- Create Task object in frameworks such as: Microsoft TPL, Intel TDD, and Parallel Runtime Library. Create and enqueue task objects require synchronization that it’s takes long time especially if we create it into recursive work such as: Quick Sort algorithm.
- Synchronization the access to shared data.
But there are a few techniques to avoid these issues, such as: Shared-Nothing, Actor Model, and Hyper Object (A.K.A. Combinable Object). Simply if we reduce the shared data by re-architect our code this will gives us a huge benefits in performance and scalability.