Stuff The Internet Says On Scalability For March 15, 2013

Hey, it's HighScalability time:

  • 0: # of Google Readers; 2.5 billion/day: new pieces of content added to Facebook; 2.7 billion/day: likes added to Facebook; 7PB/month: photos added to Facebook
  • Quotable Quotes:
    • @cwgem: It seems like cutting down API access is the stock scalability answer these days
    • @abenik: @Prismatic surfaced this article on their architecture for me. How meta.
    • @NewsBlur: The waters are rocky now, but take note that I have some time to get things right. I'm working this week to get things stable, then scale.
    • @Pinboard: Just learned that Google Reader no longer offers direct JSON export. I guess they held the annual "What should we ruin next?" staff retreat
    • @DEVOPS_BORAT: You can not able have unlimit scalability without unlimit outage.
    • Jeff: Amazon RDS Scales Up - Provision 3 TB and 30,000 IOPS Per DB Instance
    • @migueldeicaza: Google recently hired all of the Twitter's scalability team to work on Google IO checkout.
    • @skamille: Interesting to consider the greatly diminished role of networked file systems in modern distributed computing
    • @vambenepe: The server huggers have regrouped. Now they’re VM huggers, ironically. Fighting PaaS with all their might.
    • @jezhumble: If the developers can't self-service everything they need programmatically through an API, it's not a private cloud.
    • @Bremmel: Foursquare users crawl the real world like Google's spiders crawl the web - Dennis Crowley
    • @mollstam: SimCity's API (and I'm guessing region storage) is on Amazon. How can it not be auto-scaling? How can it take three days to add servers?
    • @josephmartz: Scalability gurus: It's about low coupling merging with high cohesion. More encapsulation and f*ck scaling out. I just want one #node.
    • @SQLSniper: great recipe for #sqlserver scaling from @GlennAlanBerry precon :) "scale up is like pets, scale out is like cattle" 
  • @NewsBlur's tweet feed is a great blow by blow of the crush that happened when Google Reader became a dead app walking. A signup a second...Now fetching millions of new feeds hourly...Suspended free accounts, premium accounts only....Prices increased...Redis suffered from memory corruption...Moved from one app server to 6...Dropped SES for Mail Gun...Hosting provider died...Bringing up PostgreSQL read slaves...DB server upgrades...Introduction of HAProxy...More app servers.
    • Spinning up more servers is unfortunately not a possibility. Too many moving parts and real database sharding takes weeks to write.
    • You would think, because it's been 24+ hours since the Reader bombshell, that things would have settled down by now. That they have not.
  • If scalability is specialization, Facebook is becoming very special by building energy efficient cold storage datacenters for storing mostly useless bits of photos. They are using erasure encoding, spinning down disks, flash to hold indexes, and low power drives. New highspeed datacenter interconnects are on the way as well. Open Vault.

  • Thoughtful interview with Steve Huffman on Triangulation. Loved the point that you can copy a UI but you can't copy the culture and the process that created it. That's the real life of a product and it can't be faked.

  • And here's a good interview with Jack Dorsey by Fred Wilson. Biggest lesson at Twitter:  didn't focus energy on instrumentation and data. This caused a lot of drama and infighting. They hired and acquired outside expertise that brought in a different culture. Problem solved.

  • Facebook updates indexes with billions of changes each day, live, within seconds. Building out the infrastructure for Graph Search. The key is Unicorn, a search service that maintains a many-to-many index of keys to entities, an infrastructure for building and incrementally updating indexes, constraint based searching of indexes. It's all about the indexes. Nice explanation of how searches work. Indexes are formed from a combination of map-reduce on Hive tables that produce inverted index data structures. Video.

  • Hash addiction. In What's Going On Jason Moiron riffs off Why Python, Ruby, and Javascript are Slow by Alex Gaynor. Hash/map addiction and over-allocations are the major sources of slowness in scripting languages. Normal Go code is already fairly efficient, special tricks are not necessary, nor is it necessary to mind meld through multiple layers of complexity to achieve good performance.

  • Airbnb has a lot of good tech talks available. Steve Souders just gave a talk on Wednesday, for example.

  • 1600% faster app requests with Rails on Heroku. Synchronous is out and an evented request method using EventMachine and Rainbows is in. Good discussion of threads, queues, and the proper mix of evented and sync on Hacker News

  • Big Data: Improving Hadoop for Petascale Processing at Quantcast. Quantcast processes 10’s of PB per day, measuring millions of web destinations and observing billions of media consumption events. Really good interview on how they apply machine learning at scale. Talks about Sailfish, a better shuffle phase design exploiting better high speed networking. It uses network-wide data aggregation and collects stats for better planning.

  • Interesting idea. Building User-Extensible Webapps with Local. Local is an in-browser program architecture. Web Workers run applications and REST is used a unified interface where calls to web workers and remote sites are treated the same. Also, Inter-window messaging using localStorage.

  • Facebook uses McDipper: A key-value cache for Flash storage to serve over 150 Gb/s to a CDN: a highly performant flash-based cache server that is Memcache protocol compatible. The main design goals of McDipper are to make efficient use of flash storage (i.e. to deliver performance as close to that of the underlying device as possible) and to be a drop-in replacement for Memcached. McDipper has been in active use in production at Facebook for nearly a year. 

  • Adam Holmberg with a good recap of Strata 2013. Lots of Hadoop and SQL is making a come back. 

  • Curt Monash answers the question can One database to rule them all?:  nobody has ever discovered a data layout that is efficient for all usage patterns.

  • How do I know when to scale my node.js server?: Just monitor your Event Loop. If it starts to back up more than 200ms on a regular basis, you’ll need to scale.

  • Straight from the trenches. Rebuilding DoubleClick with AngularJS. Marc Jacobs gives the low down on AngularJS, but he also talks about how to structure large systems with large teams, how they architect their UI, how they do testing, how they do releases, and much more. There's a refreshing emphasis on testing and seldom covered topics like internationalization. Well worth watching if you are interested in learning of run a team developing a complex web app.

  • Filed under awesome - Creating Indestructible Self-Healing Circuits:  It was incredible the first time the system kicked in and healed itself. It felt like we were witnessing the next step in the evolution of integrated circuits. The chip's brain does not operate based on algorithms that know how to respond to every possible scenario. Instead, it draws conclusions based on the aggregate response of the sensors. You tell the chip the results you want and let it figure out how to produce those results.

  • Tera-scale deep learning: I describe the key ideas that enabled scaling deep learning algorithms to train a very large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature.

  • Ask HN: What made SQL databases so popular? One thing I'm really missing in SQL are arrays, even just simple arrays of scalars would make a huge difference.

  • Nice way of looking at cloud architecture. Why an EC2 Instance Isn’t a Server: EC2 instances are intended to be treated as disposable building blocks that provide dynamic compute resources to a larger application.

  • eBay's quantified self in the form of an awesome looking dashboard: some of our software engineers saw that by slightly decreasing the memory allocated for an application in a pool of servers, they could remove 400 servers from the pool. This insight helped us eliminate nearly a megawatt of power consumption and avoid spending more than $2 million to refresh the servers. This simple software tweak helped us lower power consumption, decrease costs, and increase system performance, ultimately increasing our revenue per transaction. And that’s just one example.

  • DBSeer: Resource and Performance Prediction forBuilding a Next Generation Database Cloud: We argued for three key needs before clouds are appropriate for database services: (i) pricing schemes that reflect their operational costs but are also simple and intuitive to users, (ii) performance efficient mechanisms to isolate the performance of tenants from each other, while allowing softsharing of resources, and (iii) workload-specific tuning for each tenant.

  • jetsnoc  on GitHub is getting DDoSed again: It may be time for GitHub to build out multiple availability data centers and use BGP as an anycast tool. We do this. I have public facing IPv4 space that is announced from multiple facilities. Having an IP address hosted from multiple facilities is a powerful tool. This allows providers to hit our datacenter through the least amount of ASN routes. We original did this to minimize latency and create faster regional transaction processing. As an added benefit - DDoS traffic also gets routed to the nearest facility "load balancing" a DDoS so that it only affects a single facility or it splits up the 10gbps of traffic among many facilities if it is coming from many sources. O'Reilly's BGP book has a great chapter on "Anycast."

  • Scaling the Shard. So not what you think, but it's quite the head fake of a title.