Stuff The Internet Says On Scalability For December 14, 2012

In a hole in the Internet there lived HighScalability:

  • $140 Billion: trivial cost of Google fiber everywhere; 5,200 GB: data for every person on Earth; 6 hours: time it takes for a 25-GPU cluster to crack all the passwords; 
  • Quoteable Quotes:
    • hnriot: Good architecture eliminates the need for prayer.
    • @adrianco: we break AWS, they fix it. Stuff that's breaking now is mostly stuff other clouds haven't got to yet.
    • Scalability Rules: Design for 20x capacity. • Implement for 3x capacity. • Deploy for ~1.5x capacity.
  • Fast typing Aaron Delp with his AWS re:Invent Werner Vogel Keynote Live Blog.  Some key points: Decompose into small loosely coupled, stateless building blocks; Automate your application and processes; Let Business levers control the system; Architect with cost in mind; Protecting your customer is the first priority; In production, deploy to at least two availability zones; Integrate security into your application from the ground up; Build, test, integrate and deploy continuously; Don't think in single failures; Assume Nothing.
  • Benjamin Black has put together thoughtful technology operations reading list. Whoops, I'm only 8 out of 18.
  • Billions (of API requests) Served. Evernote stores hundreds of terabytes of data, 36.8 million accounts, 1.2 billion notes, 2 billion attachments, and  200 million events per day. This article looks at their Analytics architecture. In 2008 they started with MySQL and 300 shard servers. Every night they would incrementally dump log data into a star schema partitioned by week. In 2012 they went with 3 10 node hadoop clusters and ParAccel.
  • Will slime mould innovation ever end? Physicists Use Electrical Signals From Slime Mould to Make Music
  • Four Steps to Achieving High Availability in the Cloud. Excellent write up by Brian Adler: Build for server failure, Build for zone failure, Build for cloud failure, Automate and test everything.
  • The Era of Cognitive Systems: An Inside Look at IBM Watson and How it Works. Good look at the steps Watson takes when answering a question. Liked this: We at the beginning of a new era of computing, one that is less precise, but much more accurate.
  • Just how big is BigData? So big we are running out of numbers big enough to describe them. After the yottabyte will we see the hellabyte?
  • Anatomy of a Solid-state Drive. Awesome deep dive into all things SSD by Michael Cornwell. If you've wondered what it's all about then read this.
  • James Hamilton finds the Microserver Market is heating up: It all comes down to pricing. The server market generally gets hundreds of dollars per processor and price to the performance of the part and competition of which there is very little right now. The client device market generally charges 10s of dollars per processor and the competition is amazingly high.
  • Twitter with a lot of details on Blobstore, their new photo storage system. Looks like a solid architecture anyone could use. And here's how Facebook does it.
  • A free book on a hot topic is always interesting. Think Bayes is such a book. Written by Allen Downey, author of some other really good Think books, it's worth a gander. Given that you've read this entry I've updated the probability you'll get this book to 101%.
  • Good thread on How to design a data warehouse in HBase? Use a column database instead, flatten your schema, use Hive, is HDFS too slow?. Nice summary by Michael Segel: Hive is good if you're working with a large portion of your underlying data set. HBase is better if you're looking at a relatively smaller subset of the overall data. In both cases, joins are expensive and if you flatten your data against your dominant use case, you can get decent performance. Again this is where secondary indexes, including search can help. 
  • On the eventual inconsistency design in the brain: Brain circuits run their own clocks: a study now suggests that timekeeping is decentralised, with different circuits having their own timing mechanisms for each specific activity.
  • When virality attacks, the story of how a CDN helped Blitz Bomb scale from 40 visitors a day to 6 every second
  • I would have liked more specific advice, but this Tutorial: Checksum and CRC Data Integrity (Philip Koopman) is richly detailed and very useful.
  • Age old discussion:  The cloud vs. dedicated servers. thaumaturgy nails it: You should move up the hosting chain as your needs demand it, and not before. 
  • Good question: Ask HN: How are you dealing with scraping hits from EC2 machines? Most popular answer: add more resources. 
  • Thinking Methodically about Performance: For every resource, check utilization, saturation, and errors. Utilization as a percent over a time interval. Saturation as a wait queue length. Errors as the number of errors reported.
  • Ivan Pepelnjak with some implications of moving to the 10GE world: if you want to have 1:3 oversubscription ratio, you need 16 fiber pairs to connect a 64-port 10GE ToR switch (or 48x10GE+4x40GE switch or a 16-port 40GE switch) to the network core.
  • Really interesting bug:  Incorrect SyncDataType parsing for throttled types causes chrome to crash, the crash is due to faulty logic responsible for handling "throttled" data types on the client when the data types are unrecognized. < That's why change is so hard to deal with. Production is so much richer an environment than test.
  • Linux TCP/IP Tuning for Scalability. Philip Tellis talks about how LogNormal pushes their hardware and how they tune it to handle a large number of requests. These changes made it so "a single quad core vm with 1Gig of RAM has been able to handle all the load that’s been thrown at it":  ephemeral ports; TIME_WAIT state; connections; TCP window sizes; window size after idle.
  • Another performance idea: Forcing the CPU affinity can make a monothreaded process run 2-3x faster.
  • Cloud Applications Done Right - Part 3: Scalability, Availability, and Performance. Intacct talks about their architecture: we've created pools of servers dedicated to several different types of activity. We have web servers handling requests from our browser-based user interface, a pool of servers handling “offline” requests (like large runs of scheduled invoices), a pool of servers running large reports, and a pool of servers dedicated to web services API calls. We also have pools of servers that do nothing more than cache information so it’s readily accessible to any pool of servers and reduces load across the entire operation. 
  • With ParAccel such a big part of Amazon's Redshift announcement, you might be interested in Curt Monash's ParAccel update.  And while you are on Curt's site, take a look at Spark, Shark, and RDDs — technology notes, describing a next gen Hadoop from UC Berkeley’s AMPLab. 
  • Surge 2012: Real-time in the real world.  Bryan Cantrill & Brendan Gregg give a nice talk focusing on: problems encountered with data-intensive real-time (DIRT) applications. We talked through a few examples, including TCP connect latency due to dropped SYN packets, disk and file system latency, and memory leaks.