« The Design of 99designs - A Clean Tens of Millions Pageviews Architecture | Main | The Data-Scope Project - 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops »

Stuff The Internet Says On Scalability For February 3, 2012

I'm only here for the HighScalability:

  • 762 billion: objects stored on S3; $1B/Quarter: Google spend on servers; 100 Petabytes: Storage for Facebook's photos and videos.
  • Quotable Quotes:
    • @knorth2: #IPO filing says #Facebook is "dependent on our ability to maintain and scale our technical infrastructure"
    • @debuggist: Scalability trumps politics.
    • @cagedether: Hype of #Hadoop is driving pressure on people to keep everything
    • @nanreh: My MongoDB t shirt has never helped me get laid. This is typical with #nosql databases.
    • @lusis: I kenna do it, Capt'n. IO is pegged, disk is saturated…I lost 3 good young men when the cache blew up!
    • Kenton Varda: Jeff Dean puts his pants on one leg at a time, but if he had more than two legs, you'd see that his approach is actually O(log n)
  • One upon a time manufacturing located near rivers for power. Likewise software will be located next to storage, CPU, and analytics resources in a small cartel of clouds. That's the contention of Here Come the Cloud Cartels. This tributary system (pun intended) will be Amazon, Cisco Systems, Google, I.B.M., Microsoft, Oracle and a few competitors. Supposedly the benefit will be cheap computing, but when has a cartel ever lead to cheap anything?
  • bwarp thinks the advent of memristor technologies will change everything: This is the next wave of technology. It solves so many fundamental problems. I can't wait to have a few GiB of MRAM in a workstation. Load/store architectures may go away due to this. Imagine 32Gb of CPU registers...You don't switch contexts, you maintain parallel contexts...A process's state would be a mapped linear segment of memory so you can have as many contexts as you can fit in RAM. There is no need then for traditional "save everything" context switching. You just move the CPU's execution context to a different area in RAM and the context is there.
  • Scale revenue the Pinterest way: cash in on affiliate money by turning links into affiliate links using SkimLinks. Also, Pinterest Is Not A Virtual Pinboard.
  • A hosting duo: Great thread on Hacker News pondering Super Cheap Virtual Private Servers - the Wild West of Hosting. If you are deciding on which hosting option to select, this will both help and make you more confused. Also, The Five Stages of Hosting is a useful read as is the discussion on Hacker News.
  • Replication for read-scalability. Good thread on replicating over WAN with encryption. 
  • eBay explains how they use BitTorrent for package distribution within and between datacenters. Makes sense. BitTorrent is  designed for scale in bandwidth constrained environments where clients may be slow and unreliable. Also, Zynga on Updating thousands of configuration files in under a second.
  • Netflix releases Astyanax, a Swiss army knife client for Cassandra. Netflix has 5 separate Cassandra clusters, ranging from 6 to 48 nodes, so it just may work in production. It has: connection pool, cassandra-thrift API implementation, recipes and utilities.
  • Bulk email sucks. 37signals shares how they do it with %99.3 joy: they run three email-relay Postfix servers ; use Campaign Monitor for email lists; delivers tens of thousands of remote mail servers from about 15 unique IP addresses; each message is uniquely tagged so they can monitor success; monitor spam blacklists; SPF records; and more.
  • Great details on Switching to Heroku: A Django App Story: Minimize moving parts; Environment setup; Switching to Postgres; Switching to S3. Moving to Heroku has been easily the largest productivity increase we’ve had. We went from pushing once every few days to pushing a few times a day. We’ve seen a massive reduction in our downtime.
  • Probabilistically Bounded Staleness: Better understanding latency vs. consistency trade-offs in Riak. Interesting thread on how to optimize the trade-off between latency and consistency provided by partial quorums (R+W <= N) by predicting both with high accuracy.
  • Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB. Great interview on NuoDB's innovative technology implementing a distributed in-memory SQL system. It's unlike any system out there: no single point of failure; MVCC based; elastic; self-healing; transaction nodes are separate from storage nodes; geographically distributed.
  • David Farber predicts that Internet protocols simply aren't adequate for the changes in hardware and network use that will come up in a decade or so and computers will be equipped with optical connections instead of pins for networking, and the volume of data transmitted will overwhelm routers.
  • With a cloud and a good set of APIs have we entered the NoOps era for developers? So nobody has to understand how things work, make it work in the first place, or fix it when it breaks?  That's a lot of pixie dust...
  • Will the use of SSD increase the speed of DBMS? After tests, Yeul Lee concludes:  it would be more efficient cost-wise to install the data buffer to around 10 G instead of changing the storage medium to SSD.
  • EVCache: Netflix’s implementation of a highly scalable memcache-based caching solution, internally referred to as EVCache.
  • Just because it's possible doesn't mean it's easy. Calculating In-Degree using R MapReduce over Hadoop using one MapReduce job composed with another.
  • It makes sense when you think about it, but Green Disk Sizing shows faster disks use more power. What is surprising is power consumption increases nonlinearly in the size and RPM of disks. Important for data center power and capacity planning.
  • Greg Linden with another interesting set of Quick Links.

Reader Comments (2)

Regarding SSDs - the great advantage of SSDs for databases is not average performance but worst-case performance. With a suitably large buffer you can get great performance on your common queries with regular disks... And then you run a query that isn't in the buffer, or worse, thrashes the buffer, and your performance falls right off a cliff. SSD random read performance is so much better than disk that the problem all but disappears.

Oh, and MongoDB? No wonder. Wear a Redis t-shirt and you'll be fighting girls off with a stick.

February 4, 2012 | Unregistered CommenterPixy Misa

Your "Will the use of SSD increase the speed of DBMS?" link is bad.

Here's another interesting article http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication/ Etsy uses bittorrent to speed up distribution of their updated Solr indexes.

February 4, 2012 | Registered Commentermxx

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>