advertise
Monday
Mar312014

How WhatsApp Grew to Nearly 500 Million Users, 11,000 cores, and 70 Million Messages a Second

When we last visited WhatsApp they’d just been acquired by Facebook for $19 billion. We learned about their early architecture, which centered around a maniacal focus on optimizing Erlang into handling 2 million connections a server, working on All The Phones, and making users happy through simplicity.

Two years later traffic has grown 10x. How did WhatsApp make that jump to the next level of scalability?

Rick Reed tells us in a talk he gave at the Erlang Factory: That's 'Billion' with a 'B': Scaling to the next level at WhatsApp (slides), which revealed some eye popping WhatsApp stats:

What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, and hopes to serve the billions of smartphones that will soon be a reality around the globe? The Erlang/FreeBSD-based server infrastructure at WhatsApp. We've faced many challenges in meeting the ever-growing demand for our messaging services, but as we continue to push the envelope on size (>8000 cores) and speed (>70M Erlang messages per second) of our serving system.

What are some of the most notable changes from two years ago?

  • Obviously much bigger in every dimension, except the number of engineers. More boxes, more datacenters, more memory, more users, and more scale problems. Handling this level of growth with so few engineers is what Rick is most proud of: 40 million users per engineer. This is part of the win of the cloud. Their engineers work on their software. The network, hardware, and datacenter is handled by someone else.

  • They’ve gone away from trying to support as many connections per box as possible because of the need to have enough head room to handle the overall increased load on each box. Though their general strategy of keeping down management overhead by getting really big boxes and running efficiently on SMP machines, remains the same.

  • Transience is nice. With multimedia, pictures, text, voice, video all being part of their architecture now, not having to store all these assets for the long term simplifies the system greatly. The architecture can revolve around throughput, caching, and partitioning.

  • Erlang is its own world. Listening to the talk it became clear how much of everything you do is in the world view of Erlang, which can be quite disorienting. Though in the end it’s a distributed system and all the issues are the same as in any other distributed system.

  • Mnesia, the Erlang database, seemed to be a big source of problem at their scale. It made me wonder if some other database might be more appropriate and if the need to stay within the Erlang family of solutions can be a bit blinding?

  • Lots of problems related to scale as you might imagine. Problems with flapping connections, queues getting so long they delay high priority operations, flapping of timers, code that worked just fine at one traffic level breaking badly at higher traffic levels, high priority messages not getting serviced under high load, operations blocking other operations in unexpected ways, failures causing resources issues, and so on. These things just happen and have to be worked through no matter what system you are using.

  • I remain stunned and amazed at Rick’s ability to track down and fix problems. Quite impressive.

Rick always gives a good talk. He’s very generous with specific details that obviously derive directly from issues experienced in production. Here’s my gloss on his talk…

Stats

Click to read more ...

Friday
Mar282014

Stuff The Internet Says On Scalability For March 28th, 2014

Hey, it's HighScalability time:


Looks like a multiverse, if you can keep it.
  • Quotable Quotes:
    • @abt_programming: "I am a Unix Creationist. I believe the world was created on January 1, 1970 and as prophesized, will end on January 19, 2038" - @teropa
    • @demisbellot: Cloud prices are hitting attractive price points, one more 40-50% drop and there'd be little reason to go it alone.
    • @scott4arrows: Dentist "Do you floss regularly?" Me "Do you back up your computer data regularly?"
    • @avestal: "I Kickstarted the Oculus Rift, what do I get?" You get a lesson in how capitalism works.
    • @mtabini: “$20 charger that cost $2 to make.” Not pictured here: the $14 you pay for the 10,000 charger iterations that never made it to production.
    • @strlen: "I built the original assembler in JS, because it's what I prefer to use when I need to get down to bare metal." - Adm. Grace Hopper
    • tedchs: I'd like to propose a new rule for Hacker News: only if you have built the thing you're saying someone should save money by building themselves, may you say the person should build that thing.
    • lamby: Bezos predicted they would be good over the long term but said that he didn’t want to repeat “Steve Jobs’s mistake” of pricing the iPhone in a way that was so fantastically profitable that the smartphone market became a magnet for competition.
    • @PariseauTT: I feel a Netflix case study coming...everybody get your drinks ready...#AWSSummit
    • seanmccann: That's no different than startups of the past having to pay thousands to millions of dollars to setup servers. These days those same servers can be setup in minutes for a fraction of the cost. Timing is everything. Sometimes the current market pricing for a commodity is too expensive to make your business viable today but in the future that will not be the case. Just depends how long that will take. 
    • Petyr 'Littlefinger' Baelish: Chaos isn't a pit. Chaos is a ladder. Many who try to climb it fail and never get to try again. The fall breaks them. And some, are given a chance to climb. They refuse, they cling to the realm or the gods or love. Illusions. Only the ladder is real. The climb is all there is.
  • We turn those who first climb the peaks of tall mountains into heros. But what of the mountain? Mariana Mazzucato The Entrepreneurial State: Debunking Private vs. Public Sector Myths: Every feature of the iPhone was created, originally, by multi-decade government-funded research. From DARPA came the microchip, the Internet, the micro hard drive, the DRAM cache, and Siri. From the Department of Defense came GPS, cellular technology, signal compression, and parts of the liquid crystal display and multi-touch screen (joining funding from the CIA, the National Science Foundation, and the Department of Energy, which, by the way, developed the lithium-ion battery.) CERN in Europe created the Web. Steve Jobs’ contribution was to integrate all of them beautifully.

  • This is not an April Fools' joke, as the name might make a certain sort of mind consider. WebScaleSQL: MySQL goes web scale with contributions from MySQL engineering teams at Facebook, Google, LinkedIn, and Twitter. It includes lots of good work on the test suite, performance enhancements, and features to make scaling easier. So many forks at the table (MariaDB, Percona, webscale).

  • This is why we can't have nice code. Martin Sústrik argues In the Defense of Spaghetti Code, quite successfully I think, that it's often clearer to have one large 1500 line function than it is to have a refactored great big ball of mud. Commenters argue from a perfect world stance that if you do X from the start and then continue the same practices over the years and and hundreds of programmers then code can be perfected. Not so. The nature of many domains is they are just messy. At a certain level of abstraction messiness can be hidden, but when you are in the guts of a thing, messiness is preserved. 

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Thursday
Mar272014

Strategy: Cache Stored Procedure Results

Caching is not new of course, but I don't think I've heard of caching store procedure results before. It's like memoization in the database. Brent Ozar covers this idea in How to Cache Stored Procedure Results.

The benefits are the usual for doing work in the database, it doesn't take per developer per app work, just code it once in the stored proc and it works for everyone, everywhere, for all of time. The disadvantage is the usual as well, it adds extra load to a probably already busy database, so it should only be applied to heavy computations.

Brent positions this strategy as an emergency bandaid to apply when you need to take pressure off a database now. Developers can then work on moving the cache off the database and into its own tier. Interesting idea. And as the comments show the implementation is never as simple as it seems.

Wednesday
Mar262014

Oculus Causes a Rift, but the Facebook Deal Will Avoid a Scaling Crisis for Virtual Reality

Facebook has been teasing us. While many of their recent acquisitions have been surprising, shocking is the only word adequately describing Facebook's 5 day whirlwind acquisition of Oculus, immersive virtual reality visionaries, for a now paltry sounding $2 billion.

The backlash is a pandemic, jumping across social networks with the speed only a meme powered by the directly unaffected can generate.

For more than 30 years VR has been the dream burning in the heart of every science fiction fan. Now that this future might finally be here, Facebook’s ownage makes it seem like a wonderful and hopeful timeline has been choked off, killing the Metaverse before it even had a chance to begin.

For the many who voted for an open future with their Kickstarter dollars, there’s a deep and personal sense of betrayal, despite Facebook’s promise to leave Oculus alone. The intensity of the reaction is because Oculus matters to people. It's new, it's different, it creates a better future. It's important in a way sending messages or taking pictures never can be.

Let’s use Andy Baio as a sane example of the loyal opposition:

I can palpably feel the oxygen sucked out of the room. Infinite bright possibilities from indie game devs, quietly shelved.

Jigsus backs this up on reddit:

Within the last hour EVERY friend I know was developing a rift game has canceled. That's around 11 projects just gone.

So WTF? Why in this reality would Oculus sell to Facebook? At first blush it makes little sense. A more mismatched couple could hardly be created in a fantasy immersive 3D game.

Yet there's a reason and a method here that's not only about about a sweet payoff. When you look deeper at the Facebook deal it’s about Oculus' existential need to scale. And scale is something Facebook is very, very good at.

Let’s explore why this deal makes more sense for Oculus than it might first appear and the central role scalability plays in the decision making...

Click to read more ...

Monday
Mar242014

Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

This is a guest repost by Pete Soderling, Founder at Hakka Labs, creating a community where software engineers come to grow.

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.

How does Stripe do it?

Click to read more ...

Friday
Mar212014

Stuff The Internet Says On Scalability For March 21st, 2014

Hey, it's HighScalability time:

  • Quotable Quotes:
    • Chris Anderson: Petabytes allow us to say: ‘Correlation is enough.’
    • @adron: Back to writing the micro-services for my composite app with regionally distributed, highly available, key value crypto stores of unicorns!
    • DevOps Cafe: when the canary dies you don't buy a stronger canary.
    • Mark Pagel: Creativity, like evolution, is merely a series of thefts.
    • GitHub: Whoever has more capacity wins.
    • The Master: We balance probabilities and choose the most likely. It is the scientific use of the imagination.
    • Jonathan Ive: Steve, and I don’t recognize my friend in much of it. Yes, he had a surgically precise opinion. Yes, it could sting. Yes, he constantly questioned. ‘Is this good enough? Is this right?’ but he was so clever. His ideas were bold and magnificent. They could suck the air from the room. And when the ideas didn’t come, he decided to believe we would eventually make something great. And, oh, the joy of getting there!”
    • @joestump: Turns out the correct amount of RAM is always 16GB more than you have.

  • From a spec of sand does a pearl grow. In this case the oyster is Facebook, the irritant is PHP, and the pearl is Hack, a new improved version of PHP developed by Facebook. Hack now runs all of Facebook's front-end code. Lots of great comments on Hacker News and on Reddit with much lang on PHP hate. Realize Facebook is not trying to build the next perfect language. Historically they coded their web tier in PHP so they have a lot of code working in production. It would be insane to throw that all away. Over the years they've put a lot of work into making PHP more efficient. Hack continues that work by making PHP a better language, coupling PHP's convenient interpreted nature with static typing and many other features commonly found in modern programming languages. If PHP bothers you, Hack is not for you, but that's OK.

  • Dish Jump-Starts the Hopper With 8 Tuners and PS4 Support. Ultimate in local caching. Record all the things and then watch what you want later. Streaming access sucks. More local storage!

  • Cassandra Hits One Million Writes Per Second on Google Compute Engine on 300 VMs with 300 1TB Persistent Disk volumes, 15,000 clients, median latency of 10.3 ms and 95% completing under 23 ms. It took 1 hour and 10 minutes at a cost of $330 USD. All benchmark caveats apply. They wrote small records of 170 bytes. What happens when you write a large range of data sizes? When you are reading? Are you running queries? Are you backing up? Are indexes being updated? Is there contention? Etc, etc. Still, it's impressive.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...

Thursday
Mar202014

Paper: Log-structured Memory for DRAM-based Storage - High Memory Utilization Plus High Performance

Most every programmer who gets sucked into deep performance analysis for long running processes eventually realizes memory allocation is the heart of evil at the center of many of their problems. So you replace malloc with something less worse. Or you tune your garbage collector like a fine ukulele. But there's a smarter approach brought to you from the folks at RAMCloud, a Stanford University production, which is a large scale, distributed, in-memory key-value database.

What they've found is that typical memory management approaches don't work and using a log structured approach yields massive benefits:

Performance measurements of log-structured memory in RAMCloud show that it enables high client through- put at 80-90% memory utilization, even with artificially stressful workloads. In the most stressful workload, a single RAMCloud server can support 270,000-410,000 durable 100-byte writes per second at 90% memory utilization. The two-level approach to cleaning improves performance by up to 6x over a single-level approach at high memory utilization, and reduces disk bandwidth overhead by 7-87x for medium-sized objects (1 to 10 KB). Parallel cleaning effectively hides the cost of cleaning: an active cleaner adds only about 2% to the latency of typical client write requests.

And for your edification they've written an excellent paper on their findings. Log-structured Memory for DRAM-based Storage:

Click to read more ...

Wednesday
Mar192014

Strategy: Three Techniques to Survive Traffic Surges by Quickly Scaling Your Site

Matthew Might, as a first responder to a surprise traffic surge on his inexpensive linode hosted blog, took emergency steps that you might find useful in a similar situation:

  1. Find the bottleneck. Reloading the page in firebug showed the first page took 24 seconds to load and after that everything else loaded quickly. In retrospect this burst meant the site was thread limited as the CPU was idle.
  2. Cut image sizes in half with a shell script using ImageMagick's convert. Load time is now 12 seconds.
  3. Turn dynamic content into static content using a static index.html file  copied using the browser's "view source" feature. Load time is now 6 seconds.
  4. Added threads to the Apache configuration file. Load time is now 2 seconds. Crises averted.

Because of this quick thinking and quick action the patient survived to serve pages another day.

And in fine post-mortem tradition some of the future changes are: 

Click to read more ...

Tuesday
Mar182014

Sponsored Post: Airseed, Uber, ScaleOut Software, Couchbase, Tokutek, Logentries, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7  

 

Who's Hiring?

  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.  
    • Senior Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Software Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Quality Assurance Engineer. The iOS Systems team is looking for a Quality Assurance engineer. In this role you will be expected to work hand-in-hand with the software engineering team to find and diagnose software defects. Please apply here.

  • Airseed -- a Google Ventures backed, developer platform that powers single sign-on authentication, rich consumer data, and analytics -- is hiring lead backend and fullstack engineers (employees #4, 5, 6). More info here: https://www.airseed.com/jobs

  • Join the team that scales Uber supply globally! Our supply engineering team is responsible for prototyping, building, and maintaining the partner-facing platform. We're looking for experienced back-end developers who care about developing highly scalable services. Apply at https://www.uber.com/jobs/4810.

  • We need awesome people @ Booking.com - We want YOU! Come design next
    generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.

Fun and Informative Events

  • The Biggest MongoDB Event Ever Is On. Will You Be There? Join us in New York City June 23-25 for MongoDB World! The conference lineup includes Amazon CTO Werner Vogels and Cloudera Co-Founder Mike Olson for keynote addresses.  You’ll walk away with everything you need to know to build and manage modern applications. Register before April 4 to take advantage of super early bird pricing.

  • How to Scale MySQL for Big Data Applications. A Guide to Evaluating TokuDB on March 20th at 1pm ET. You can do more than you think with the MySQL you already have. Learn how to use MySQL or MariaDB in Big Data applications by simply upgrading the storage engine with TokuDB. Register now.

  • April 3 Webinar: The BlueKai Playbook for Scaling to 10 Trillion Transactions a Month. As the industry’s largest online data exchange, BlueKai knows a thing or two about pushing the limits of scale. Find out how they are processing up to 10 trillion transactions per month from Vice President of Data Delivery, Ted Wallace. Register today.

Cool Products and Services

  • As one of the fastest growing VoIP services in the world Viber has replaced MongoDB with Couchbase Server, supporting 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term for their third generation architecture.  See the full story on the Viber switch.

  • Do Continuous MapReduce on Live Data? ScaleOut Software's hServer was built to let you hold your daily business data in-memory, update it as it changes, and concurrently run continuous MapReduce tasks on it to analyze it in real-time. We call this "stateful" analysis. To learn more check out hServer.

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Click to read more ...

Monday
Mar172014

Intuitively Showing How To Scale a Web Application Using a Coffee Shop as an Example

This is a guest repost by Sriram Devadas, Engineer at Vistaprint, Web platform group. A fun and well written analogy of how to scale web applications using a familiar coffee shop as an example. No coffee was harmed during the making of this post.

I own a small coffee shop.

My expense is proportional to resources
100 square feet of built up area with utilities, 1 barista, 1 cup coffee maker.

My shop's capacity
Serves 1 customer at a time, takes 3 minutes to brew a cup of coffee, a total of 5 minutes to serve a customer.

Since my barista works without breaks and the German made coffee maker never breaks down,
my shop's maximum throughput = 12 customers per hour.

 

Web server

Customers walk away during peak hours. We only serve one customer at a time. There is no space to wait.

I upgrade shop. My new setup is better!

Expenses
Same area and utilities, 3 baristas, 2 cup coffee maker, 2 chairs

Capacity
3 minutes to brew 2 cups of coffee, ~7 minutes to serve 3 concurrent customers, 2 additional customers can wait in queue on chairs.

Concurrent customers = 3, Customer capacity = 5

 

Scaling vertically

Business is booming. Time to upgrade again. Bigger is better!...

Click to read more ...