Stuff The Internet Says On Scalability For March 16, 2012
HighScalability is What We Do:
- 454,400: Number of Amazon servers; 45PB: Facebook Data Warehouse, grows exponentially; 5 Atoms: Ultimate limit of thermodynamics; YouTube: 4 billion views/day, 60 hours of video uploaded every minute, revenue doubled in 2010
- Quotable quotes:
- @adrianco: Walmart labs run large single region Cassandra clusters with Intel SSDs and have been in production for two years. Working well for them.
- @mybellemac: Scalability is a mother. #pinterest
- @fakesigi: Thanks for the correction. I saw cloud computing, scalability and my brain turned off.
- @BVA100: I disagree with "If it ain't broke, don't fix it". We ought to be forward thinkers, concerned with leading indicators and scalability.
- Dilbert on the meaning of it all.
- Cassandra and Solid State Drives. DataStax's Rick Branson with a sweet explanation of how Cassandra was built for a world of spinning disks, which means it only writes sequentially, which turns out to be a good way to use SSDs too.
- Improving Performance by 1000x. Josiah Carlson explains how they went from a slow and expensive follower list storage solution in Redis to custom built code, that by removing hash tables and shrinking storage overhead became 1000x faster.
- Big Beautiful List of Cloud Platform (PaaS) systems. If it's not there then it must not exist.
- Adrian Cockroft with an epic slide deck on all things Netflix on the cloud. Netflix has the most evolved architecture (that we know of) on the cloud and here it all is one presentation.
- Redis vs Memcached vs Cassandra. Interesting thread on Google Groups. "the only reason why memcached is still in the market is because people have used it", "Cassandra and Redis are pretty far apart in the world of databases; even introducing the comparison is like comparing a BMW M5
to a Land Rover." Also, When you should use MongoDB, What are the underlying data structures used for Redis? - Where does Big Data meet Big Database. Really good talk by Ben Stopford on the Big Data landscape. The conclusion, pick the right tool for the job isn't new, but he takes a well thought out path to get there.
- A trio of Facebook: Under the Hood: Building the Location API and Under the Hood: Building Facebook Messenger for Windows and MySQL and Database Engineering: Mark Callaghan.
- A duo of Twitter: Generating Recommendations with MapReduce and Scalding and Cassovary: A Big Graph-Processing Library
- Optimize Performance and Scalability with Parallelism and Concurrency. Bob Hancock with an epic talk on: how the operating system handles your requests; design principles on how to use concurrency; parallelism to optimize your program's performance and scalability; covers processes, threads, generators, coroutines, non-blocking IO, and the gevent library.
- Speeding up Mongoose queries by requesting only the fields you need. Nick Fishman explains why returning a subset of fields yields such a big performance improvement: The problem isn’t so much that MongoDB can’t return the data quickly enough. Rather, Node.js has to spend much of its time parsing extra JSON into JavaScript objects, which is both unnecessary and time-consuming.
-
Spark - Lightning-Fast Cluster Computing: Spark provides an abstraction called resilient distributed datasets (RDDs) to support cluster programming applications efficiently. RDDs are stored in memory between queries (as long as enough RAM is available), without requiring replication for fault tolerance.
- Just One Second Delay In Page-Load Can Cause 7% Loss In Customer Conversions
- Fortress - worth a look if you are interested in languages specially designed to program peta-scale supercomputers and distributed systems. Does it have enough worse is better to succeed? Also, How to Think about Parallel Programming: Not!
- The Limitation of MapReduce: A Probing Case and a Lightweight Solution: In this paper, we analyze the limitations of MapReduce and present the design and implementation of a new lightweight parallelization framework, MRlite.
- Rigel Project Tools Released. A 1000-core capable simulator, compiler toolchain, binutils, libraries, and sample benchmarks.
- Characterizing Flash Memory: Anomalies, Observations, and Applications: Despite flash memory’s promise, it suffers from many idiosyncrasies such as limited durability, data integrity problems, and
- asymmetry in operation granularity
- Transactional Memory Everywhere: 2012 Update for HTM. Transactional memory isn't for everything, but what is it for: HTM is likely to be at its best for large in-memory data structures that are difficult to statically partition but that are dynamically partitionable, in other words, the conflict probability is reasonably low.
- Ad targeting at Yahoo. Greg Linden with a good review of a paper by Yahoo on their ad targeting. Daily isn't real-time.