Stuff The Internet Says On Scalability For September 2, 2011

Scale the modern way / No brush / No lather / No rub-in / Big tube 35 cents - Drug stores / HighScalability:

  • 8868 Tweets per second during VMAs; Facebook: 250 million photos uploaded each day; Earth: 7 Billion People Strong
  • Potent quotables:
    • @kevinweil : Wow, 8868 Tweets per second last night during the #VMAs. And that's just the writes -- imagine how many reads we were doing!
    • @tristanbergh : #NoSQL isn't cool, it's a working kludge of existing architectures, bowing to the current tech limits, not transcending them
    • @krishnan : I would love to switch the backend infra to Amazon anytime but our top 20 customers will not allow us 
    • @ianozsvald : Learning about all the horrible things that happen when you don't plan (@socialtiesapp) for scalability. Trying to be creative now...
  • After a particularly difficult Jeopardy match, Watson asked IBM to make him a new cognitive chip so he could continue to kick human butt. The result, a newish chip design collocates data and computation. RAM and CPU are interconnected together. IBM explains. "One core contains 262,144 programmable synapses and the other contains 65,536 learning synapses." The win: lower power usage and better pattern recognition. On HackerNews. Watson is now said to happy, petting a virtual kitten that never leaves his virtual lap.
  • What do we have here? Spotify, surreptitiously, is P2P? Crack investigation by Frank Catalano in Practical Nerd: The hidden price of “free”, says it's so. [I was] less pleasantly surprised to see that when Spotify wasn’t playing audio, it was using my network connection. A lot.
  • Netflix's Adrian Cockcroft: I come to use clouds, not to build them. Recommends: build an AWS clone that scales. While a good deal for Netlix in that they could play cloud providers off against each other to reduce their costs, trying to compete with Amazon on Amazon's home turf seems a poor game to play. 
  • As your waiter, I should warn you, these HotOS XIII papers are burning hot with knowledge and ready to read. Consume at your own risk.
  • StorageMojo puts the smackdown on the idea flash is cheaper than disk. Only after dedupe, compression, using the most expensive disks, and unrealistic capacities.
  • Is the GAE backend too expensive? $100/month for 256MB of RAM, 1.2GHz CPU, plus bandwidth and storage. Plus Microsoft smartly taking advantage, Azure is reportedly lowering prices as of October 1st- $0.04/hour for 1GHz and 768MB RAM. Plus, plus, a very funny take by one customer: Optimizing your AppEngine website for the new pricing: How to get from 422 $ per month back to free in a few lines. 
  • Think Stats: Probability and Statistics for Programmers by Allen B. Downey. 95% of readers will be within two standard deviations of liking ths book. 
  • Analyzing Apache Logs with Riak. Simon Buckle with a quick and clean explanation of how to look at logs using Riak, curl for upload, and JSON for queries.
  • HBase and Cassandra on StackOverflow. Andrew Purtell brings it: HBase is proven in production deployments larger than the largest publicly reported Cassandra; HBase supports replication between clusters (i.e. data centers); Cassandra does not have strong consistency in the sense that HBase provides; Cassandra's RandomPartitioner / hash based partitioning means efficient MapReduce or table scanning is not possible, whereas HBase's distributed ordered tree is naturally efficient for such use cases; Cassandra is no less complex than HBase; HBase has substantially more unit tests; The master-slave versus peer-to-peer argument is larger than Cassandra vs. HBase, and not nearly as one sided as claimed. There is no obvious winner, instead, a series of trade offs. Also an illuminating discussion of what consistency really means. Some contend HBase is more stable, Cassandra being a fickle beast. Tim Less: Cassandra is rife with cascading cluster failure scenarios. I would not recommend running Cassandrain a highly-available high-volume data scenario, but don't hesitate to do so for HBase.
  • Scalable Execution of LIGO AstrophysicsApplication in a Cloud Computing Environment. Dong Leng shows how to process rippling gravitational waves on the cloud. A nice image.
  • Quora: What is involved in a startup "scaling"? Scaling refers to the period in a startup's life when management and board feels like they can systematically accelerate growth with confidence that the resources they put in will yield great and measurable results. 
  • A good idea: Global Internet Speedup. DNS services and Content Delivery Networks (Google, Bitgravity, CDNetworks, DNS.com and Edgecast) cooperate to bring geo load balancing to the Internets.
  • Scaling up Machine Learning - Parallel and Distributed Approaches. Self-publishing: $5. This book: $90. Dispersing knowledge: 0. 
  • Odersky Explains Shared-Memory Concurrency. Threading is hard, so whaddya goin' to do about it? Martin Odersky in a brisk 17 minute talks O'Reilly OSCON, gives you Scala's answer. The first part of the video deftly explains why humans suck at concurrency. mutable state + parallel processing = non-determinism. So remove mutable state which means program functionally. Threading is programming in time and functional is programming in time. Scala, for parallelism gives you parallel and distubuted collections plus parallel DSLs. For concurrency you get Akka, which is Actors, STM, and futures. Good talk.
  • How would you create a 20 million entry circular queue per user?
  • Ebay's Matthias Spycher with an expert level tour through his High-Throughput, Thread-Safe, LRU Caching implementation. 1M lookups/s with up to 3 threads per core (3x12 = 36 -- the concurrency level). Java memory model limits performance. 
  • HipG: Parallel Processing of Large-Scale Graphs.  A distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pieces of sequential work on graph nodes.
  • Salesforce is a new player in the hosted database market. Rather than merely offer low level database functionality, they are also going up stack with web services APIs and a pre-built social data models and APIs for feeds, user profiles, status updates, and a following model for all database records. Free for 3 users, 100,000 records, 50,000 transactions per month. $10 per month for each set of 100,000 records beyond that and $10 per month for each set of 150,000 transactions. 
  • IBM's new transactional memory: make-or-break time for multithreaded revolution. Peter Bright reports BlueGene/Q moves transactional memory into the processor, reducing the overhead experienced by STM. The question then, is STM magic concurrency dust? Azul's Cliff Click has some doubts. You need to be just as much an STM expert as a lock expert to make it work. Dr. Pizza in the comments makes the programmer ease of use argument:
    • "easier way to program" == "fewer deadlocks and other problems" == "more reliability"
    • "easier way to program" == "greater use of multiple threads and multiple cores" == "more performance"

For more stuff to know click below...