How HipChat Stores and Indexes Billions of Messages Using ElasticSearch and Redis

This article is from an interview with Zuhaib Siddique, a production engineer at HipChat, makers of group chat and IM for teams.

HipChat started in an unusual space, one you might not think would have much promise, enterprise group messaging, but as we are learning there is gold in them there enterprise hills. Which is why Atlassian, makers of well thought of tools like JIRA and Confluence, acquired HipChat in 2012.

And in a tale not often heard, the resources and connections of a larger parent have actually helped HipChat enter an exponential growth cycle. Having reached the 1.2 billion message storage mark they are now doubling the number of messages sent, stored, and indexed every few months.

That kind of growth puts a lot of pressure on a once adequate infrastructure. HipChat exhibited a common scaling pattern. Start simple, experience traffic spikes, and then think what do we do now? Using bigger computers is usually the first and best answer. And they did that. That gave them some breathing room to figure out what to do next. On AWS, after a certain inflection point, you start going Cloud Native, that is, scaling horizontally. And that’s what they did.

But there's a twist to the story. Security concerns have driven the development of an on-premises version of HipChat in addition to its cloud/SaaS version. We'll talk more about this interesting development in a post later this week.

While HipChat isn’t Google scale, there is good stuff to learn from HipChat about how they index and search billions of messages in a timely manner, which is the key difference between something like IRC and HipChat. Indexing and storing messages under load while not losing messages is the challenge.

This is the road that HipChat took, what can you learn? Let’s see…

Click to read more ...


Stuff The Internet Says On Scalability For January 3rd, 2014

Hey, it's HighScalability time, can you handle the truth?

Should software architectures include parasites? They increase diversity and complexity in the food web.
  • 10 Million: classic hockey stick growth pattern for GitHub repositories
  • Quotable Quotes:
    • Seymour Cray: A supercomputer is a device for turning compute-bound problems into IO-bound problems.
    • Robert Sapolsky: And why is self-organization so beautiful to my atheistic self? Because if complex, adaptive systems don’t require a blue print, they don’t require a blue print maker. If they don’t require lightning bolts, they don’t require Someone hurtling lightning bolts.
    • @swardley: Asked for a history of PaaS? From memory, public launch - Zimki ('06), BungeeLabs ('06), Heroku ('07), GAE ('08), CloudFoundry ('11) ...
    • @neil_conway: If you're designing scalable systems, you should understand backpressure and build mechanisms to support it.
    • Scott Aaronson...the brain is not a quantum computer. A quantum computer is good at factoring integers, discrete logarithms, simulating quantum physics, modest speedups for some combinatorial algorithms, none of these have obvious survival value. The things we are good at are not the same thing quantum computers are good at.
    • @rbranson: Scaling down is way cooler than scaling up.
    • @rbranson: The i2 EC2 instances are a huge deal. Instagram could have put off sharding for 6 more months, would have had 3x the staff to do it.
    • @mraleph: often devs still approach performance of JS code as if they are riding a horse cart but the horse had long been replaced with fusion reactor
  • Now we know the cost of bandwidth: Netflix’s new plan: save a buck with SD-only streaming
  • Massively interesting Stack Overflow thread on Why is processing a sorted array faster than an unsorted array? Compilers may grant a hidden boon or turn traitor with a deep deceit. How do you tell? It's about branch prediction.
  • Can your database scale to 1000 cores? Nope. Concurrency Control in the Many-core Era: Scalability and Limitations: We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
  • Not all SSDs are created equal. Power-Loss-Protected SSDs Tested: Only Intel S3500 PassesWith a follow up. If data on your SSD can't survive a power outage it ain't worth a damn. 

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...


xkcd: How Standards Proliferate:

The great thing about standards is there are so many to choose from. What is it about human nature that makes this so recognizably true?


Paper: Nanocubes: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets

How do you turn Big Data into fast, useful, and interesting visualizations? Using R and technology called Nanocubes. The visualizations are stunning and amazingly reactive. Almost as interesting as the technologies behind them.

David Smith wrote a great article explaining the technology and showing a demo by Simon Urbanek of a visualization that uses 32Tb of Twitter data. It runs smoothly and interactively on a single machine with 16Gb of RAM.  For more information and demos go to

David Smith sums it up nicely:

Despite the massive number of data points and the beauty and complexity of the real-time data visualization, it runs impressively quickly. The underlying data structure is based on Nanocubes, a fast datastructure for in-memory data cubes. The basic idea is that nanocubes aggregate data hierarchically, so that as you zoom in and out of the interactive application, one pixel on the screen is mapped to just one data point, aggregated from the many that sit "behind" that pixel. Learn more about nanocubes, and try out the application yourself (modern browser required) at the link below.

Abstract from Nanocubes for Real-Time Exploration of Spatiotemporal Datasets:

Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop’s main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.

Sponsored Post: Netflix, Logentries, Host Color, Booking, Spokeo, Apple, ScaleOut, MongoDB, BlueStripe, AiScaler, Aerospike, New Relic, LogicMonitor, AppDynamics, ManageEngine, Site24x7

Who's Hiring?

  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.
    • Quality Assurance Engineer. The iOS Systems team is looking for a Quality Assurance engineer. In this role you will be expected to work hand-in-hand with the software engineering team to find and diagnose software defects. Please apply here.
    • Sr Software Engineer iPhone. Do you love building highly scalable, distributed web applications? Does the idea of a fast-paced environment make your heart leap? Do you want your technical abilities to be challenged every day, and for your work to make a difference in the lives of millions of people? Please apply here.
    • Sr Software Engineer. The iOS Systems Team is looking for a Software Engineer to work on operations, tools development and support of worldwide iOS Device sales and activations. Please apply here
    • Sr. Security Software Developer. We are looking for an excellent programmer who's done extensive security programming. This individual will participate in various security projects from the start to the end. In addition to security concepts, it's important to have intricate knowledge of different flavors of Unix operating systems to develop code that's compact and optimal. Familiarity with key exchange protocols, authentication protocols and crypto analysis is a plus. Please apply here.

  • The Netflix Cloud Performance Team is hiring. Help tackle the more complex scalability challenges emerging on the cloud today, wrangling tens of thousands of instances, handling billions of requests a day. We are searching for a Senior Performance Architect and a Senior Cloud Performance Tools Engineer

  • We need awesome people @ - We want YOU! Come design next
    generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply:

  • Spokeo is hiring a Senior Backend Developer. We've spent years agonizing over the best way to construct an elegant, simplistic, yet highly powerful people search engine. Spokeo deals with problems of ginormous scale, so a strong understanding and appreciation for algorithms and efficiency is desired. Please apply here.

  • Spokeo is hiring a Senior Software Developer - Web Applications. Build features that involve any of our products, from the universal people search interface and functionality to the construction of family trees to a portal that connects customers to employees and employer data. Please apply here.

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.

  • New Relic is looking for a Java Instrumentation Engineer, Java Scalability Engineer,  Distributed Systems Engineer and Android app engineer in Portland, OR. Ready to scale a web service with more incoming bits/second than Twitter? 

Fun and Informative Events

  • Your amazing event here.

Cool Products and Services

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • Why choose Host Color for your webhosting needs?  Redundant network, 100% Uptime SLA guarantee. Excellent location and high class data center. Powerful resource-rich, server systems. 24/7 Support - responsive, friendly and knowledgeable. Scalable, fairly priced, risk-free and hosting service. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting

  • Rapidly Develop Hadoop MapReduce Code. With ScaleOut hServer™ you can use a subset of your Hadoop data and run your MapReduce code in seconds for fast code development and you don’t need to load and manage the Hadoop software  stack, it's a self-contained Hadoop MapReduce execution environment. To learn more check out

  • MongoDB Backup Free Usage Tier Announced. We're pleased to introduce the free usage tier to MongoDB Management Service (MMS). MMS Backup provides point-in-time recovery for replica sets and consistent snapshots for sharded systems with minimal performance impact. Start backing up today at

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • Aerospike Capacity Planning Kit. Download the Capacity Planning Kit to determine your database storage capacity and node requirements. The kit includes a step-by-step Capacity Planning Guide and a Planning worksheet. Free download.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Click to read more ...


What Happens While Your Brain Sleeps is Surprisingly Like How Computers Stay Sane

There's a deep similarity between how long running systems like our brains and computers accumulate errors and repair themselves. 

Reboot it. Isn’t that the common treatment for most computer ailments? And you may have noticed now that your iPhone supports background processing it reboots a lot more often? Your DVR, phone, computer, router, car, and an untold number of long running computer systems all suffer from a nasty problem: over time they accumulate flaws and die or go crazy.

Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.

There’s new research out on how our brains are cleansed during sleep that has some interesting parallels to how we keep long running hardware-software systems up and running properly. This is a fun topic. Let’s explore it a little more...

Click to read more ...


Stuff The Internet Says On Scalability For December 20th, 2013

Hey, it's HighScalability time (with so much cool info this week it will blow your mind):

  • How many drones would it take to replace Santa? With a fleet of some 80 million or so F-16 drones the entire worldwide delivery could be completed in just over eight hours. Impressive, but a world without Rudolf is not a world I wish to contemplate.
  • Quotable Quotes:
    • @Loh: Always wanted to travel back in time to try fighting a younger version of yourself? Software development is the career for you!
    • @mraleph: often devs still approach performance of JS code as if they are riding a horse cart but the horse had long been replaced with fusion reactor
    • @peakscale: "The c3.large is 40% faster and has more than double the memory than the c1.medium but costs about the same"
    • @techmilind: Conversation with an ex-Yahoo, now at a Telecom company. Replaced $22M of Teradata by $450K of Hadoop in AWS. It's the economics, stupid!
    • Brett Slatkin: But Doom was a huge success anyways. Those things didn't matter. Carmack's team made the right vastly simplifying assumptions. It was worth it. This is how you apply the ethos of "worse is better" and not let the perfect be the enemy of the good.

  • OK, I understand why Gregor Rothfuss doesn't like how PHP chose function names to minimize hash collisions. It makes for strange names. But come on, isn't it sort of geeky cool too? Why should we let some arbitrary aesthetic sense create a world of baseless conformity? I love that this choice comes from such a pure technical ontology that it has transformed into artistic expression.

  • Et tu, Brute? NASA Drops OpenStack For Amazon Cloud. Ouch, that has to hurt. When a major contributor to an alliance defects it's not a good sign for the strength of the alliance. Of course paying off key players is a time honored way of crushing alliances.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...


How to get started with sizing and capacity planning, assuming you don't know the software behavior?

Here's a common situation and question from the mechanical-sympathy Google group by Avinash Agrawal on the black art of capacity planning:

How to get started with sizing and capacity planning, assuming we don't know the software behavior and its completely new product to deal with?

Gil Tene, Vice President of Technology and CTO & Co-Founder, wrote a very understandable and useful answer that is worth highlighting:

Click to read more ...


22 Recommendations for Building Effective High Traffic Web Software

This is a guest post by Ashwanth Fernando, Software Engineer from the trenches at large scale internet companies.

Inspired by the book "Effective Java" by Joshua Bloch, I wanted to share my holistic recommendations on building high traffic web software (i.e. web applications/services that serve high traffic loads). Some of these items may not be just about software design but also around surrounding areas such as the engineering organization, culture etc.

Two disclaimers up front:

1) This is my opinion.
2) There will be real world situations where the below principles will be wrong as in all things "software". Please use common sense all the time.

Consider using more than one datacenter

There have been numerous horror stories about businesses, ahem going out of business because they just had a single datacenter. Its really important to have more than one data center if you want to protect yourself from natural disasters or electrical supply failures. Run all your datacenters in active-active configuration. It may cost extra money, but its well worth it rather than having an active passive configuration and then finding out at the end that for some pieces of data, your passive hardware was not consistent with the active one.

Consider a sparse datacenter deployment

Click to read more ...


Stuff The Internet Says On Scalability For December 13th, 2013

Hey, it's HighScalability time:

Test your sense of scale. Is this image of something microscopic or macroscopic? Find out.

  • 80 billion: Netflix logging events per day; 10 petabytes: data; six million: Foursquare checkins per day; 
  • Quotable Quotes:
    • George Lakoff: What can't all your thoughts be conscious? Because consciousness is linear and your brain is parallel. The linear structure of consciousness could never keep up.
    • @peakscale: "Engineers like to solve problems. If there are no problems handily available, they will create their own problems" - Scott Adams
    • @kiwipom:  “Immutability is magic pixie dust that makes distributed systems work” - Adrian Cockcroft 
    • @LachM: Netflix: SPEED at SCALE = breaks EVERYTHING. #yow13
    • Joe Landman: … you get really annoyed at the performance of grep on file IO (seriously folks? 32k or page size sized IO? What is this … 1992?) so you rewrite it in 20 minute in Perl, and increase the performance by 5-8x or so.
    • @rjrogers87: "Goldman Sacs has 36,000 employees, 6,000 are developers.  They support these folks w/half-a million cores. #GartnerDC” 
    • @KentLangley: Dear Amazon AWS, Please STOP with the aggressive reserved instances sales push. I use on-demand ON PURPOSE.

  • Good story of how moved from Google App Engine to EC2, nodejs, and mongodb. Migration decision based on: ease of development, performance and cost. GAE suffers from slow database operatations, costly over provisioning of instances, slow startup times lead to timeouts, low 1MB memcache size limit, bulk loads and exports of data a nightmare, search is slow. Like about nodejs: one programming language on client and server, portability, fast. Like from GAE: could see everything in one place with the management console, easy deployment, easy to create new code and test it out.

  • Wikipedia's Order of Magnitude page: The pressure of a human bite is about 1/9th of the atmospheric pressure on Venus. The fastest bacterium on earth is just outstripping the fastest glacier. A square meter of sunshine in the spring imparts about 1 horsepower.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Click to read more ...