Stuff The Internet Says On Scalability For May 2nd, 2014

Hey, it's HighScalability time:

Google's new POWER8 server motherboard
  • 1 trillion: number of scents your nose can smell; millions of square feet: sprawling new server farms
  • Quotable Quotes:
    • Gideon Lewis-Kraus: As the engineer and writer Alex Payne put it, these startups represent “the field offices of a large distributed workforce assembled by venture capitalists and their associate institutions,” doing low-overhead, low-risk R&D for five corporate giants. In such a system, the real disillusionment isn’t the discovery that you’re unlikely to become a billionaire; it’s the realization that your feeling of autonomy is a fantasy, and that the vast majority of you have been set up to fail by design.
    • @aphyr: I was already sold on immutability, pure functions, combinators, etc. What forced me out of Haskell was the impenetrable, haphazard docs.
    • @monicalent: Going to be migrating away from Rackspace to save some cash. Recommendations? Both Digital Ocean and Linode are half the price for 1GB RAM.
    • Linus Torvalds: So while I'd love for 'make' to be super-efficient, at the same time I'd much rather optimize the kernel to do what make needs really well, and have CPU's that don't take too long either.
    • Joe Landman: when the networking revolution comes, the cheap switches will be the first ones against the wall
    • @etherealmind: Buying public cloud can say I can't afford a house so I'll buy a tent. Because that works just as well, right ?
    • Neil DeGrasse Tyson: The act of doing it perfectly is the measure of it going unnoticed.[....] when it's done perfectly it goes unnoticed or, at best, it's just taken for granted.
  • How we scaled Freshdesk (Part I) – Before Sharding. Requests per week boomed from 2 million to 65 million. They scaled vertically for as long as they could to handle the increased load. Increasing RAM, CPU and I/O. First using read slaves for their heavily read weighted traffic, then assigning queries to particular slaves. Writes still needed to scale, so they turned on MySQL partitioning. Then caching of objects and html partials. They also used different storage engines for different functions. RedShift for analytics and data mining. Redis for state and as a job queue. But in the end they had to embrace the shard.

  • This took some guts. Fouresquare uses data driven decision making to decide to deportalze their app: We looked at the session analysis and saw that only 1 in 20 sessions had both social and discovery. Why not actually just split those apart, because 19 out of 20 times, tapping on one icon or the other, you have satisfied your need completely.

  • The blockchain story is bullshit: Looking at the blockchain from a realist’s standpoint, it is not obvious that there is a need for a worse-performing database, that an unregulated oligarchy has disproportionate power over, that isn’t improved with administrator arbitration. It looks like a technology looking for a problem to solve, rather than a technology created to solve a problem.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so keep on going)...

Click to read more ...


Paper: Can Programming Be Liberated From The Von Neumann Style? 

Famous computer scientist John Backus, he's the B in BNF(Backus-Naur form) and the creator of Fortran, gave a Turing Award Lecture titled Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs, that has layed out a division in programming that lives long after it was published in 1977. 

It's the now familiar argument for why functional programming is superior:

The assignment statement is the von Neumann bottleneck of programming languages and keeps us thinking in word-at-a-time terms in much the same way the computer's bottleneck does.


The second world of conventional programming languages is the world of statements. The primary statement in that world is the assignment statement itself. All the other statements of the language exist in order to make it possible to perform a computation that must be based on this primitive construct: the assignment statement.

Here's a response by Dijkstra A review of the 1977 Turing Award Levgure by John Backus. And here's an interview with Dijkstra.

Great discussion on a recent Hacker News thread and an older thread. Also on Lambda the Ultimate. Nice summary of the project by David Bolton

There's nothing I can really add to the discussion as much smarter people than me have argued this endlessly. Personally, I'm more of a biology than a math and a languages are for communicating with people sort of programmer. So the argument from formal systems have never persuaded me greatly. Doing a proof of a bubble sort in school was quite enough for me. Its applicability to real complex systems has always been in doubt.

The problem of how to best utilize distributed cores is a compelling concern. Though the assumption that parallelism has to be solved at the language level and not how we've done it, at the system level, is not as compelling.

It's a passionate paper and the discussion is equally passionate. While nothing is really solved, if you haven't deep dived into this dialogue across the generations, it's well worth your time to do so. 


10 Tips for Optimizing NGINX and PHP-fpm for High Traffic Sites

Adrian Singer has boiled down 7 years of experience to a set of 10 very useful tips on how to best optimize NGINX and PHP-fpm for high traffic sites:

  1. Switch from TCP to UNIX domain sockets. When communicating to processes on the same machine UNIX sockets have better performance the TCP because there's less copying and fewer context switches.
  2. Adjust Worker Processes. Set the worker_processes in your nginx.conf file to the number of cores your machine has and  increase the number of worker_connections.
  3. Setup upstream load balancing. Multiple upstream backends on the same machine produce higher throughout than a single one.
  4. Disable access log files. Log files on high traffic sites involve a lot of I/O that has to be synchronized across all threads. Can have a big impact.
  5. Enable GZip
  6. Cache information about frequently accessed files
  7. Adjust client timeouts.
  8. Adjust output buffers.
  9. /etc/sysctl.conf tuning.
  10. Monitor. Continually monitor the number of open connections, free memory and number of waiting threads and set alerts if thresholds are breached. Install the NGINX stub_status module.

Please take a look at the original article as it includes excellent configuration file examples.


Sponsored Post: Apple,, PagerDuty, HelloSign, CrowdStrike, Gengo, ScaleOut Software, Couchbase, Tokutek, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7  

Who's Hiring?

  • Apple is hiring a Senior Engineer in their Mobile Services team. We seek an accomplished server-side engineer capable of delivering an extraordinary portfolio of features and services based on emerging technologies to our internal customers. Please apply here

  • Apple is hiring a Software Engineer in their Messaging Services team. We build the cloud systems that power some of the busiest applications in the world, including iMessage, FaceTime and Apple Push Notifications. You'll have the opportunity to explore a wide range of technologies, developing the server software that is driving the future of messaging and mobile services. Please apply here.

  • Apple is hiring an Enterprise Software Engineer. Apple's Emerging Technology Services group provides a Java based SOA platform for various applications to interact with each other. The platform is designed to handle millions of messages a day with very low latency. We have an immediate opening for a talented Software Engineer in a highly visible team who is passionate about exploring emerging technologies to create elegant scalable solutions. Please apply here

  • Engine Programmer - C/C++. Wargaming|BigWorld is seeking Engine Programmers to join our team in Sydney, Australia. We offer a relocation package, Australian working visa & great salary + bonus. Your primary responsibility will be to work on our PC engine. Please apply here

  • Senior Engineer wanted for large scale, security oriented distributed systems application that offers career growth and independent work environment. Use your talents for good instead of getting people to click ads at CrowdStrike. Please apply here.

  • Ops Engineer - Are you passionate about scaling and automating cloud-based websites? Love Puppet and deployment scripts? Want to take advantage of both your sys-admin and DevOps skills? Join HelloSign as our second Ops Engineer and help us scale as we grow! Apply at

  • Human Translation Platform Gengo Seeks Sr. DevOps Engineer. Build an infrastructure capable of handling billions of translation jobs, worked on by tens of thousands of qualified translators. If you love playing with Amazon’s AWS, understand the challenges behind release-engineering, and get a kick out of analyzing log data for performance bottlenecks, please apply here.

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.

Fun and Informative Events

  • The Biggest MongoDB Event Ever Is On. Will You Be There? Join us in New York City June 23-25 for MongoDB World! The conference lineup includes Amazon CTO Werner Vogels and Cloudera Co-Founder Mike Olson for keynote addresses.  You’ll walk away with everything you need to know to build and manage modern applications. Register before April 4 to take advantage of super early bird pricing.

  • Upcoming Webinar: Practical Guide to SQL - NoSQL Migration. Avoid common pitfalls of NoSQL deployment with the best practices in this May 8 webinar with Anton Yazovskiy of Thumbtack Technology. He will review key questions to ask before migration, and differences in data modeling and architectural approaches. Finally, he will walk you through a typical application based on RDBMS and will migrate it to NoSQL step by step. Register for the webinar.

Cool Products and Services

  • PagerDuty helps operations and DevOps engineers resolve problems as quickly as possible. By aggregating errors from all your IT monitoring tools, and allowing easy on-call scheduling that ensures the right alerts reach the right people, PagerDuty increases uptime and reduces on-call burnout—so that you only wake up when you have to. Thousands of companies rely on PagerDuty, including Netflix, Etsy, Heroku, and Github.

  • GigOM Interviews Aerospike at Structure Data 2014 on Application Scalability. Aerospike Technical Marketing Director, Young Paik explains how you can add rocket fuel to your big data application by running the Aerospike database on top of Hadoop for lightning fast user-profile lookups. Watch this interview.

  • Couchbase: NoSQL and the Hybrid Cloud. If a NoSQL database can be deployed on-premise or it can be deployed in the cloud, why can’t it be deployed on-premise and in the cloud? It can, and it should. Read how in this article converting three use cases for hybrid cloud deployments of NoSQL databases: master / slave, cloud burst, and multi-master.

  • Do Continuous MapReduce on Live Data? ScaleOut Software's hServer was built to let you hold your daily business data in-memory, update it as it changes, and concurrently run continuous MapReduce tasks on it to analyze it in real-time. We call this "stateful" analysis. To learn more check out hServer.

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Click to read more ...


How Disqus Went Realtime with 165K Messages Per Second and Less than .2 Seconds Latency

How do you add realtime functionality to a web scale application? That's what Adam Hitchcock, a Software Engineer at Disqus talks about in an excellent talk: Making DISQUS Realtime (slides).

Disqus had to take their commenting system and add realtime capabilities to it. Not something that's easy to do when at the time of the talk (2013) they had had just hit a billion unique visitors a month.

What Disqus developed is a realtime commenting system called “realertime” that was tested to handle 1.5 million concurrently connected users, 45,000 new connections per second, 165,000 messages/second, with less than .2 seconds latency end-to-end.

The nature of a commenting system is that it is IO bound and has a high fanout, that is a comment comes in and must be sent out to a lot of readers. It's a problem very similar to what Twitter must solve

Disqus' solution was quite interesting as was the path to their solution. They tried different architectures but settled on a solution built on Python, Django, Nginx Push Stream Module, and Thoonk, all unified by a flexible pipeline architecture. In the process they we are able to substantially reduce their server count and easily handle high traffic loads.

At one point in the talk Adam asks if a pipelined architecture is a good one? For Disqus messages filtering through a series of transforms is a perfect match. And it's a very old idea. Unix System 5 has long had a Streams capability for creating flexible pipelines architectures. It's an incredibly flexible and powerful way of organizing code.

So let's see how Disqus evolved their realtime commenting architecture and created something both old and new in the process...

Click to read more ...


Stuff The Internet Says On Scalability For April 25th, 2014

Hey, it's HighScalability time:

New World Record BASE jumping from World's Tallest Building. #crazy
  • 30 billion: total Pinterest pins; 500,000,000: What'sApp users (700 million photos and 100 million videos every single day); 1 billion: Facebook active users on phones and tablets.
  • Quotable Quotes:
    • @jimplush: Google spent 2.3 billion on infrastructure in Q1. Remember that when you say you want to be "the Google of something"
    • Clay Shirky: I think one of the things that happened to the P2P market for infrastructure is that users preference for predictable pricing vs resource-sensitive pricing is so overwhelming that they will overpay to anyone who can promise flat prices. And because the logic of centralization vs decentralization is so price sensitive, I don't think there is any logical reason to assume a broadly stable class of apps, separate from current pricing data for energy, cycles, and storage.
    • @chipchilders: Stop freaking making new projects just for the sake of loose coupling. Damn it people.
    • Benedict Evans (paraphrased): A startup 15 years ago raised 10 million dollars, had 100 people, and a million users. Now you raise a million dollars, have 10 people, and a 100 million users.
    • @francesc: "Go was created for the cloud infrastructure, when we used to call it servers" - @rob_pike at #gophercon
    • @postwait: distributed system: an arbitrarily large state machine w/ "unknown" & "f*cked" states wherein you can't observe the movement between states.
    • @enneff: "In Ruby regular expressions are actually very fast... compared to all the other things you can do." --@derekcollison LOL #gophercon
    • Steve Jobs: This needs to be like magic. Go back, this isn’t magical enough!
    • @jamesurquhart: The complexity isn’t in the tech, it is in the interconnected apps and comps in systems. Managing interconnectedness is managing complexity.
    • Alex Pentland: Put another way, social physics is about how human behavior is driven by the exchange of ideas—how people cooperate to discover, select, and learn strategies and coordinate their actions—rather than how markets are driven by the exchange of money.

  • Steve Jobs with the carrot: This is important, this needs to happen, and you do it. And now this stick: Guess what, you’re Margaret from now on. 

  • A fantastic look at Uplink Latency of WiFi and 4G Networks by Ilya Grigorik: WiFi can deliver low latency first hop if the network is mostly idle. By contrast, 4G networks require coordination between the device and the radio tower for each uplink transfer. First off, latency aside, and regardless of wireless technology, consider the energy costs of your network transfers! Periodic transfers incur high energy overhead due to the need to wake up the radio on each transmission. Second, same periodic transfers also incur high uplink coordination overhead - 4G in particular. In short, don't trickle data. Aggregate your network requests and fire them in one batch: you will reduce energy costs and reduce latency by amortizing scheduling overhead.

  • It's like Sherlock for programmers. How to detect bank loan fraud with graphs : part 2. A fun way to use graph databases, finding criminals by analyzing patterns using graph algorithms. Also, Building a Graph-based Analytics Platform: Part I

  • Cache Invalidation Strategies With Varnish Cache. Good explanation of different techniques: purging, bans, tagging, grace, TTL. It covers the issue of distributing cache invalidations to multiple caches, but it doesn't seem fault tolerant. 

  • To be smarter the brain had to get more social.  A general pattern for intelligence at different scales? Finding turns neuroanatomy on its head: Researchers present new view of myelin: The fact that it is the most evolved neurons, the ones that have expanded dramatically in humans, suggests that what we're seeing might be the "future." As neuronal diversity increases and the brain needs to process more and more complex information, neurons change the way they use myelin to "achieve" more.  It is possible that these profiles of myelination may be giving neurons an opportunity to branch out and 'talk' to neighboring neurons. These long myelin gaps may be needed to increase neuronal communication and synchronize responses across different neurons. < For more on the amazing ways human social networks improve problem solving take a look at Social Physics: How Good Ideas Spread-The Lessons from a New Science.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so keep on going)...

Click to read more ...


Here's a 1300 Year Old Solution to Resilience - Rebuild, Rebuild, Rebuild

How is it possible that a wooden Shinto shrine built in the 7th century is still standing? The answer depends on how you answer this philosophical head scratcher: With nearly every cell in your body continually being replaced, are you still the same person?

The Ise Grand Shrine has been in continuous existence for over 1300 years because every twenty years an exact replica has been rebuilt on an adjacent footprint. The former temple is then dismantled.

Now that's resilience. If you want something to last make it a living part of a culture. It's not so much the building that is remade, what is rebuilt and passed down from generation to generation is the meme that the shrine is important and worth preserving. The rest is an unfolding of that imperative.

You can see echoes of this same process in Open Source projects like Linux and the libraries and frameworks that get themselves reconstructed in each new environment.

The patterns of recurrence in software are the result of Darwinian selection process that keeps simplicity and value alive in human minds. 

A blog post on Persuing Wabi has some fabulous photos of the shrine along with a brief description of why it's the way it is:

Click to read more ...


This is why Microsoft won. And why they lost.

My favorite kind of histories are those told from an insider's perspective. The story of Richard the Lionheart is full of great battles and dynastic intrigue. The story of one of his soldiers, not so much. Yet the soldiers' story, as someone who has experienced the real consequences of decisions made and actions taken, is more revealing.

We get such a history in Chat Wars, a wonderful article written by David Auerbach, who in 1998 worked at Microsoft on MSN Messenger Service, Microsoft’s instant messaging app (for a related story see The Rise and Fall of AIM, the Breakthrough AOL Never Wanted).

It's as if Herodotus visited Microsoft and wrote down his experiences. It has that same sort of conversational tone, insightful on-the-ground observations, and facts no outsider might ever believe.

Much of the article is a play-by-play account of the cat and mouse game David plays changing Messenger to track AOL's Instant Messenger protocol changes. AOL repeatedly tried to make it so Messenger could not interoperate with AIM and each time Messenger countered with changes of their own. AOL finally won the game with a radical and unexpected play. A great read for programmers. 

For a general audience David's explanation of how and why Microsoft came to dominance and why they lost that dominance is most revealing. It stares directly into the heart of the entropy that brings everything down in the end.

Why Microsoft Won 

Click to read more ...


Stuff The Internet Says On Scalability For April 18th, 2014

Hey, it's HighScalability time:

Scaling to the top of "Bun Mountain" in Hong Kong
  • 44 trillion gigabytes: size of the digital universe by 2020; 6 Times: we have six times more "stuff" than the generation before us.
  • Quotable Quotes:
    • Facebook: Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.
    • @windley: The problem with the Internet of Things is right now it’s more like the CompuServe of Things
    • Chip Overclock: If you want to eventually generate revenue, you must first optimize for developer productivity; everything else is negotiable.
    • @igrigorik: if you are gzipping your images.. you're doing it wrong:  - check your server configs! and your CDN... :)
    • Seth Lloyd: The arrow of time is an arrow of increasing correlations.
    • @kitmacgillivray: When will Google enable / require all android apps to have full deep search integration so all content is visible to the engine?
    • Neal Ford: Yesterday's best practices become tomorrow's anti-patterns.
    • Rüdiger Möller: just made a quick sum up of concurrency related issues we had in a 7-15 developer team soft realtime application (clustered inmemory data grid + GUI front end). 95% of threads created are not to improve throughput but to avoid blocking (e.g. IO, don't block messaging receiver thread, avoid blocking the event thread in a GUI app, ..).
    • Ansible: When done correctly, automation tools are giving them time back -- and helping out of this problem of needing to wear many hats.

  • Amazon and Google are in an epic battle to dominate the cloud—and Amazon may already have won: If Amazon’s entire public cloud were a single computer, it would have five times more capacity than those of its next biggest 14 competitors—including Google—combined. Every day, one-third of people who use the internet visit a site or use a service running on Amazon’s cloud.

  • What books would you select to help sustain or rebuild civilization? Here's Neal Stephenson’s list. He was too shy to do so, but I would certainly add his books to my list. 

  • 12-Step Program for Scaling Web Applications on PostgreSQL. Great write up of lessons learned by that they used to sustain 10s of thousand of concurrent users at 3K req/sec. Highly detailed. 1) Add more cache, 2) Optimize SQL, 3) Upgrade Hardware and RAM, 4) Scale reads by replication, 5) Use more appropriate tools, 6) Move write heave table out, 7) Tune Postures and your File System, 8) Buffer and serialize frequent updates, 9) Optimize DB Schema, 10) Shard busy tables vertically, 11) Wrap busy tables with services, 12) Shard services backend horizontally.

  • Is this a soap opera? It turns out Google and not Facebook is buying Titan Aerospace, makers of high flying solar powered drones. Google's fiber network would make a great backbone network for a drone and loon powered wireless network, completely routing around the telcos.

  • Building Carousel, Part I: How we made our networked mobile app feel fast and local. Dropbox changes to an eventualy consistent / optimistic replication syncing model to make their app "feel fast, responsive, and local, even though the data on which users operate is ultimately backed by the Dropbox servers."  Lesson: don’t block the user! Instead of requiring changes to be propagated to the server synchronously. Local and remote photos are treated as equivalent objects. Actions take effect immediately locally then work there way out globally. Changes are used to stay consistent. A fast hash is used to tell what photos have not been backed up to dropbox. Remote operations happen asynchronously.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...


Six Lessons Learned the Hard Way About Scaling a Million User System 

Ever come to a point where you feel you've learned enough to share your experiences in the hopes of helping others traveling the same road? That's what Martin Kleppmann has done in an lovingly written Six things I wish we had known about scaling, an article well worth your time.

It's not advice about scaling a Twitter, but of building a million user system, which is the sweet spot for a lot of projects. His conclusion rings true:

Building scalable systems is not all sexy roflscale fun. It’s a lot of plumbing and yak shaving. A lot of hacking together tools that really ought to exist already, but all the open source solutions out there are too bad (and yours ends up bad too, but at least it solves your particular problem).

Here's a gloss on the six lessons (plus a bonus lesson):

Click to read more ...