Stuff The Internet Says On Scalability For November 4, 2011

You're in good hands with HighScalability:

  • Netflix - Cassandra, AWS, 288 instances, 3.3 million writes per second.
  • Quotable quotes:
    • @bretlowery : "A #DBA walks into a #NoSQL bar, but turns and leaves because he couldn't find a table."
    • @AdanVali : HP to Deploy Memristor Powered SSD Replacement Within 18 Months
    • @eden : Ori Lahav: "When planning scalability, think x100, design x5 and deploy x1.5 of current traffic"
    • @jkalucki : If you are IO bound, start with your checkbook!
  • Everything I Ever Learned About JVM Performance Tuning @Twitter. Learn how to tune your Hotspot and other Javasutra secrets.
  • By moving off the cloud Mixipanel may have lost their angel status. Why would they do such a thing? Read Why We Moved Off The Cloud for the details. The reason for the fall:  highly variable performance. Highly variable performance is incredibly hard to code or design around (think a server that normally does 300 queries per second with low I/O wait suddenly dropping to 50 queries second at 100% disk utilization for literally hours). It’s solvable, certainly, but with lots of time and money and it’s hard to justify the cost when there’s a better alternative available. On reddit. On Hacker News. Is that a bell I hear?
  • Neo4j's Emil Eifrém on the state of NOSQL today. Emil declares, using graph theory, that the future will happen after the present and that the future of NoSQL is good. Three trends: ACID dissolves BASE, more query languages (no standard), richer schemas instead of schema free only. In short: NoSQL more like SQL. Challenges: getting the word out, tool support, middleware support. One size does not fit all. This is a developer based revolution. Key-value stores are dead.
  • Instagram on Storing hundreds of millions of simple key-value pairs in Redis to map 300 million photos back to the user IDs. 
  • Evernote shares more on their indexing system for image recognition. Hardware: cluster of 37 well appointed nodes; OS: Debian; Software: in-house software for queue handling and image processing, along with a set of image recognition engines to handle various types of text. 
  • Krishna Sankar in BigData Counts summarizes what big systems are doing today: 200 Million tweets / day; Teradata – (eBay) 84 PB capacity, 250 nodes; S3 600 Billion objects.  He also wrote a nice summary of the High Performance Transaction Systems Workshop.
  • We need more bandwidth captain! UCSD Study: Not Enough Bandwidth for an 'Internet of Things'. Instead of telcos grinding out profit from the current infrastructure, how about we make more?
  • Shy about using the public cloud? Kiip has a good discussion of why moving to the virtual private cloud on AWS may be for you: virtual private networks give you control over things such as routing tables, DHCP option sets, and more.
  • What hardware should you use for Cassandra? Helpful discussion on this Google Groups thread
  • As Stack Exchange grows it is considering virtualization. Why would the kings of bare metal be pondering the imponderable? Virtual clusters: The idea essentially is that you have a rack of commodity machines with many VMs per machine and still have the ability to do live migration. Using DRDB (think raid 1 across multiple machines) allows for features like live migration without shared storage.
  • Some good stuff from Curt Monash: NoSQL notes, Transparent relational OLTP scale-out, More notes on Oracle NoSQL, Nested data structures, Text data management.
  • From “Overnight” to “Real-time”: A Two-Year NoSQL Case Study by Benjamin Anderson. The transition to Cloudant enabled 10x growth and allowed us to open our technology to a much broader range of applications — though not without some bumps along the way
  • James Hamilton on how Software Defined Networking Has Come of Age. Now it can drink in public without sneaking behind hardware's back. Software Defined Networks (SDN), will: 1) empower network owners/operators, 2) increase the pace of network innovation, 3) diversify the supply chain, and 4) build a robust foundation for future networking innovation.
  • Cool infographic on America’s largest data centers. It takes 11 diesel generators to power Microsoft’s 700,000-square foot data center in Chicago, which stores data for XBox Live, the company’s Bing search engine, its email service Hotmail and over 200 other sites. The QTS Metro data center in Atlanta, Georgia — which stores data for Twitter’s 100 million active users — takes 19 diesel generators.
  • G+/Cloud-Infrastructure is a community around answering cloud questions you may be interested in looking at. 
  • Randy Bias on Is Open Compute Ready for Prime Time? The Open Compute Project, it’s an attempt to ‘open source’ hardware design and specifications, in much the way that software is open sourced.  I’d like to call for those involved with OCP to start thinking about not just ‘open compute’, but ‘practical compute’. We have customers now who have many thousands of square feet of datacenter space that is a sunk cost. These facilities are largely paid for and can be retrofitted to some degree, but there are limits. DC space is 4-5% of overall costs, which is not insignificant, especially if that facility is already paid for.
  • Graphene-based transistors may be on the horizon by Chris Lee. Who cares? Paul Eccles answers: They could be massively faster, (like 100x) with much lower power consumption, as well as significantly miniaturised. Pretty exciting.
  • What I learned hanging out at the vascular surgery conference. I can’t overstate how important I think it is for us to branch out and learn about more than just the jobs we do every day.