Stuff The Internet Says On Scalability For March 27th, 2015

Hey, it's HighScalability time:


@scienceporn: That Hubble Telescope picture explained in depth. I have never had anything blow my mind so hard.

  • hundreds of billions: files in Dropbox; $2 billion: amount Facebook saved building their own servers, racks, cooling, storage, flat fabric, etc.
  • Quotable Quotes:
    • Buckminster Fuller: I was born in the era of the specialist. I set about to be purposely comprehensive. I made up my mind that you don't find out something just to entertain yourself. You find out things in order to be able to turn everything not just into a philosophical statement, but actual tools to reorganize the environment of man by which greater numbers of men can prosper. That's been my main undertaking.
    • @mjpt777: PCI-e based SSDs are getting so fast. Writing at 1GB/s for well less than $1000 is so cool.
    • @DrQz: All meaning has a pattern, but not all patterns have a meaning.
    • Stu: “Exactly once” has *always* meant “at least once but dupe-detected”. Mainly because we couldn’t convince customers to send idempotent and communitative state changes.
    • @solarce: When some companies have trouble scaling their database they use Support or Consultants. Apple buys a database vendor. 
    • @nehanarkhede: Looks like Netflix will soon surpass LinkedIn's Kafka deployment of 800B events a day. Impressive.
    • @ESPNFantasy: More than 11.57 million brackets entered. Just 14 got the entire Sweet 16 correct.
    • @BenedictEvans: A cool new messaging app getting to 1m users is the new normal. Keeping them, and getting to 100m, is the question.
    • @jbogard: tough building software systems these days when your only two choices are big monoliths and microservices
    • @nvidia: "It isn't about one GPU anymore, it's about 32 GPUs" Andrew Ng quotes Jen-Hsun Huang. GPU scaling is important #GTC15

  • FoundationDB, a High Scalability advertiser and article contributer, has been acquired. Apple scooped them up. Though saving between 5% to 10% less hardware than Cassandra seems unlikely. And immediately taking their software off GitHub is a concerning trend. It adds uncertainty to the entire product selection dance. Something to think about.

  • In the future when an AI tries to recreate a virtual you from your vast data footprint, the loss of FriendFeed will create a big hole in your virtual personality. I think FF catches a side of people that isn't made manifest in other mediums. Perhaps 50 years from now people will look back on our poor data hygiene with horror and disbelief. How barbaric they were in the past, people will say. 

  • When the nanobots turn the world to goo this 3D printer can recreate it again. New 3-D printer that grows objects from goo. Instead of a world marked by an endless battle of good vs evil we'll have a ceaseless cycle of destruction and rebirth through goo. That's unexpected. A modern mythology in the making.

  • Videos from the Redis Conference 2015 are now available. Some interesting sounding talks.

  • talisto: You may want to look into the "Compute Optimized" EC2 instances. A c4.large instance is cheaper than an m3.large instance (and less than half the cost of an m3.xlarge instance), and c4 instances will allow you to enable enhanced networking. We've noticed that the latency on our instances went down substantially after migrating from m3 to c4 instances and enabling enhanced networking.

  • There's a new force out there. It's the power of matching large number of people with data. Genetics And That Striped Dress. 23andMe was able to round up 25,000 people to answer questions about that damn dress and then data mine their collective DNA for patterns. What did they find? It doesn't matter. What matters is that they can do it.

  • How do you backup the Internet Archive? It's doubling every 30 months. The Opposite Of LOCKSS (Lots Of Copies Keep Stuff Safe). In 2017 backing up the Internet Archive would cost $3.6 million on Glacier. A rounding error in pizza budgets for many companies. To save money the idea is to copy data less to save money on storage costs. A form of erasure encoding that supports parallel access is one possibility. The replication factor would be 1.5 instead of the typical 3.

  • Good list of Readings in Databases. A lot of stuff is on GitHub now. What happens if GitHub is bought and all this wonderful content is pitched into never never land?

  • It would be interesting to know more about what improbable.io is doing. The hints are tantalizing, but not specific enough to make sense of things. We've had scheduling and orchestration layers before. Improbable: enabling the development of large-scale simulated worlds: Improbable’s technology solves the parallelization problem for an important class of problems: anything that can be defined as a set of entities that interact in space; Hello [Simulated] World: You can imagine our approach as a swarm of decentralised, heterogenous workers collaborating together to form a simulation much larger than any single worker can understand. As simulated worlds rapidly evolve, the most efficient distribution of these workers varies, so they need to migrate between physical machines in real time. 

  • Uber and other resource sharing schemes should be happy about this. Millennials are moving to the city. Wrapping a service around other people's stuff will continue to be profitable.

  • Unlimited. That's a lot. Amazon Goes After Dropbox, Google, Microsoft With Unlimited Cloud Drive Storage

  • A new logical data layer? Any one to rule them all layer becomes just another layer to integrate later when all the other layers grow up and want to rule them all. There are no End of History plays in software.

  • Fulfilling a Pikedream: the ups of downs of porting 50k lines of C++ to Go: In personal terms however I feel the outcome was suboptimal, in the sense that I wrote two to three times as much code as would have been needed in a language with parametric polymorphism.

  • It seems we have a lot of errors in our error handling. No surprise there. Errors are combinatorial, features are linear. Very hard to test. Paper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems": Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes...almost all (92%) of the catastrophic system failures are the result of incorrect handling...58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code...To prevent this 23% failures, the paper suggests 100% statement coverage testing on the error handling logic.

  • Looking for some good hardware oriented podcasts? Then take a look at PC Perspective and This Week in Computer Hardware. Both are really good.

  • Excellent look at Understanding 802.11 Medium Contention: Ethernet stations can detect collisions over the wire because a portion of energy is reflected back to the transmitter when a collision occurs...Wi-Fi stations cannot detect collisions over the air and use a more cautious “randomized access” medium contention approach...Wi-Fi collision avoidance mechanisms include inter-frame spacing for different high-level frame types (for instance, control versus data frames) and a contention window to introduce randomness into the distributed medium contention logic of radio transmitters.

  • LinkedIn made their home page faster. First, they improved iteration speed by creating a separate Git repository for the home page so it could be it's own editable, deployable, testable unit. To make the page faster they used: server-side rendering, BigPipe to stream parts of the page independently, lazy loading of images, redirect reduction. Scala is used because it's better at async programming than Java. Result: double-digit percentage speed improvements (across many regions in both 50th and 90th percentile).

  • Wow. A highly technical yet practical look at Data Mining Problems in Retail.

  • Interesting problem, how does Dropbox index hundreds of billions of files? Firefly – Instant, Full-Text Search Engine for Dropbox (Part 1). Goal is to return search results in under 250 msecs and index changes instantly. A non-obvious problem is do you keep an index per user? That's 300 million indexes. They chose to follow their name space architecture. I look forward to more details.

  • Go’s design is a disservice to intelligent programmers” (nomad.so). Expect no resolution, but good points made on both sides. For my $$ MetaCosm has the sense of it: its purpose is to help ship products. It isn't sexy, it isn't fancy, it isn't interesting, it isn't groundbreaking. If it doesn't help you ship products, don't use it. Go will continue to grow at the insane pace it has because it hit a sweet spot and cared about standard library and documentation. You will continue to see high profile successes because it doesn't try to be fancy, it tries to get out of your way so you can do that job of actually transforming data. So you can be a programmer and it can be a language, not a hobby.

  • Straightforward explanation of How Edmunds.com Used Spark Streaming to Build a Near Real-Time Dashboard. They use Flume, Spark Streaming, HBase, Lily, Solr, Banana to calculate page view counts and unique visitor counts for every make and model.

  • Joyent is making themselves a cloud player. Triton: Docker and the “best of all worlds”. Good sense making comment by_delirium: The main catch (which may also be a plus) is that the Joyent set of technologies is quite opinionated about how to manage a cloud, so you more or less have to "buy in" and do things its way. Your node OS will be SmartOS, nodes will be PXE-booted, your filesystem will be ZFS, the network topology will work in a specific way, etc. More practically, it also takes some time to be proficient in operating it (or even just get it working), because there are quite a few moving parts. Getting a SmartDataCenter deployment running, and then things like Manta and now sdc-docker running on top of it, is more than an afternoon's work. But if you're willing to buy in to the system and its choices fit your needs, it's really well engineered, imo much more "done right" than e.g. OpenStack is.

  • Cool 6-Pi cluster to serve a Drupal site. Lessons Learned building the Raspberry Pi Dramble

  • Apache Mesos: The True OS For The Software Defined Data: Google’s use of Borg became the inspiration for Twitter to develop a similar resource manager that could help them get rid of the dreaded “Fail Whale.”...Today, Mesos manages the placement of workloads for over 30,0000 servers at Twitter and the Fail Whale is a thing of the past.  Mesos is also deployed at companies such as Airbnb, eBay, and Netflix.

  • Both strange and deep. Very deep. Jitter: Making Things Better With Randomness: Communication in distributed systems isn't the only place that adding randomness comes in handy. It's a remarkably wide-spread idea, that's found use across many areas of engineering. The basic pattern across all these fields is the same: randomness is a way to prevent systematically doing the wrong thing when you don't have enough information to do the right thing. Also, Bring the Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms.

  • The eternal question: BLOBs in the database or BLOBs in the filesystem? Aren't BLOBs data? Shouldn't they be in the database? Shouldn't a database know how to do the right thing for different data types? These questions and more are addressed in Getting BLOBS out of the DatabaseTajinder Singh: We created one central database for all meta data and one database for every month that contain tables related to blobs and some other related meta data tables. We create new database for each month to store the blob data for that month.

  • If you are thinking about how to structure your engineering and product teams then Lessons Learned from Scaling a Product Team has some thoughtful advice.

  • Yes, using any sort of text message format is really just silly. an_message: Format Agnostic Data Transfer: On the deserialization side, Native gives us a ~9 fold increase in speed over JSON and a ~3.5 fold increase in speed over Protobuf. On the serialization side, the numbers are less dramatic but we still see a ~3 fold increase over json and numbers on par with Protobuf serialization. Avro is abysmal but this could be an artifact of the avro-c library and not the underlying format.

  • GopherConIndia 2015 had some really good talks. Principles of designing Go APIs with channels with a lot of solid real world experience driven advice. APIs should be sync runs counter to much current thinking much makes sense in a cheap concurrency environment. Also liked Concurrent, High-Performance Data-Access With Go. Use a smart client to keep track of data locations enables load balancing. By using SSDs you can collapse the canonical load balancer, cache, and database hierarchy.

  • Like this, using Docker to distribute builds to developer laptops. Scale Your Jenkins Compute With Your Dev Team: Use Docker and Jenkins Swarm

  • Entropy-scaling search of massive biological data: The continual onslaught of new omics data has forced upon scientists the fortunate problem of having too much data to analyze. Luckily, it turns out that many datasets exhibit well-defined structure that can be exploited for the design of smarter analysis tools. We introduce an entropy-scaling data structure---which given a low fractal dimension database, scales in both time and space with the entropy of that underlying database---to perform similarity search, a fundamental operation in data science.