Stuff The Internet Says On Scalability For February 26th, 2016

Wonderful diagram of @adrianco Microservices talk at #OOP2016 by @remarker_eu  


If you like this sort of Stuff then please consider offering your support on Patreon.

  • 350,000: new Telegram users per day; 15 billion: messages delivered by Telegram per day; 50 billion suns: max size of a black hole; 10,000x: lower power for Wi-Fi; 400 hours: video uploaded to YouTube every minute;

  • Quotable Quotes:
    • sharemywin: I don't think consensus scales. So, I think they'll be an ecosystem of block chains.
    • @aneel: "There is no failover process other than the continuous dynamic load balancing." 
    • Jono MacDougall: If you are happy hosting your own solution, use Cassandra. If you want the ease of scaling and operations, Use DynamoDB.
    • @plamere: Google’s BigQuery is *da bomb* - I can start with 2.2Billion ‘things’ and compute/summarize down to 20K in < 1 min.
    • Haifa Moses: We’re evaluating a totally new software model that allows us to automatically diagnose if a failure occurs during a mission and for messages to be displayed for flight controllers on the ground
    • @fmbutt: IBM abstracted analog calculation. MS abstracted HW. Goog abstracted SW. Powerful Mobile AI could abstract clouds. 
    • Jon Grall: Essentially, there’s a massive oversupply of apps, and the app markets are now saturated and suffering from neglect and short-term thinking by the companies who operate them. 
    • jhgg: At work we moved to GCE at the beginning of this year, from Linode after they were having stability issues over the christmas break. No complaints from us. So far have been very happy with it. We were considering moving to AWS, but to realize the same pricing as GCE we'd have to purchase reserved instances - the sustained usage discounts have been huge for us.
    • Brave New Geek: Python and App Engine were fast. Not like “this code is f*cking fast” fast—what we call performance—more like “we need to get this sh*t working so we have jobs tomorrow” fast—what we call delivery. 
    • There are more Quotable Quotes in the full article.

  • You have to love the datacenter of the future. Data is stored in the DNA of seeds. Compute inhabits electronic plants using xylem, leaf, and vein in the creation of digital organic electronic circuits. Instead of walking into a cold dead datacenter we'll frolic in an uplifted Garden of Eden.

  • Relying APIs is like building bridges and skyscrapers out of materials that constantly change their properties. Just Landed is Shutting Down: Since Just Landed launched in 2012, the cost of running the service has steadily increased over time. While flight data remains expensive, the real source of the cost increases has been adapting to the demise or restructuring of supporting services such as StackMob, UrbanAirship, and Bing Maps that Just Landed previously relied on. Traffic and mapping data in particular, much of which used to be free, has become quite expensive, and is now tightly controlled by big companies under oppressive Terms of Service.

  • With Spotify moving to the Google Cloud Platform it looks like Google may have found a friendly marketing face to play the same role Netflix plays for AWS. Why make the move? nrh: Spotifier here. Frankly, price is not the biggest factor in a decision like this. If we were going for the lowest cost cloud option, it probably wouldn't be either AWS or Google - there are other providers who are hungrier for business that would be willing to do deep cuts at our scale. The way we think about this is that there are basically two classes of cloud services: commodities and differentiated services. Commodities are storage/network/compute, and the big players are going to compete on price and quality on these for the foreseeable future (as with most commodities). The differentiated services stuff is a bit more interesting. Different players have different strengths and weaknesses here - AWS has way, way better capabilities when it comes to administration and access control and identity management, for example (which is actually pretty important when trying to do this in a large org). The places were Google is strong (data platform) are the places that are most important for us as a business. Compelling: dataproc+gcs, bigquery, pubsub, dataflow Made it safe: high-enough quality, cheap enough.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Click to read more ...


When Should Approximate Query Processing Be Used?

This is a guest repost by Barzan Mozafari, an assistant professor at University of Michigan and an advisor to a new startup,, that recently launched an open source OLTP + OLAP Database built on Spark.

The growing market for Big Data has created a lot of interest around approximate query processing (AQP) as a means of achieving interactive response times (e.g., sub-second latencies) when faced with terabytes and petabytes of data. At the same time, there is a lot of misinformation about this technology and what it can or cannot do.

Having been involved in building a few academic prototypes and industrial engines for approximate query processing, I have heard many interesting statements about AQP and/or sampling techniques (from both DB vendors and end-users):

Myth #1. Sampling is only useful when you know your queries in advance
Myth #2. Sampling misses out on rare events or outliers in the data
Myth #3. AQP systems cannot handle join queries
Myth #4. It is hard for end-users to use approximate answers
Myth #5. Sampling is just like indexing
Myth #6. Sampling will break the BI tools
Myth #7. There is no point approximating if your data fits in memory

Although there is a grain of truth behind some of these myths, none of them are actually accurate. There are many different forms of sampling, approximation, and error quantification, and their nuances are missed by these blanket statements. In other words, many of these impressions are simply based on wrong assumptions and/or misunderstanding of basic AQP terminology.

Anyhow, instead of going over each of these statements and explaining why they are categorically wrong, in this post I’d like to answer the positive question: When can (and should) one use approximate answers? Note that by asking this question, I am implicitly giving away that I don’t think approximate answers are always useful. A perfect example where you don’t want to use approximation is in billing departments. (Although every time I look at my own Internet bill, I start to think that even this example has its own exceptions. I’m too afraid to mention my Internet provider’s name here but I am sure you can guess).

Anyhow, let’s discuss the key reasons and use-cases for approximate answers.

1. Use AQP when you care about interactive response times

Click to read more ...


Google's Transition from Single Datacenter, to Failover, to a Native Multihomed Architecture


Making a system work in one datacenter is hard. Now imagine you move to two datacenters. Now imagine you need to support multiple geographically distributed datacenters. That’s the journey described in another excellent and thought provoking paper from Google: High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads.

The main idea of the paper is that the typical failover architecture used when moving from a single datacenter to multiple datacenters doesn’t work well in practice. What does work, where work means using fewer resources while providing high availability and consistency, is a natively multihomed architecture:

Our current approach is to build natively multihomed systems. Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently. Additionally, planned datacenter outages and maintenance events are completely transparent, causing minimal disruption to the operational systems. In the past, such events required labor-intensive efforts to move operational systems from one datacenter to another

The use of “multihoming” in this context may be confusing because multihoming usually refers to a computer connected to more than one network. At Google scale perhaps it’s just as natural to talk about connecting to multiple datacenters.

Google has built several multi-homed systems to guarantee high availability (4 to 5 nines) and consistency in the presence of datacenter level outages: F1 / Spanner: Relational Database; Photon: Joining Continuous Data StreamsMesa: Data Warehousing. The approach taken by each of these systems is discussed in the paper, as are the many challenges is building a multi-homed system: Synchronous Global State; What to Checkpoint; Repeatable Input; Exactly Once Output.

The huge constraint here is having availability and consistency. This highlights the refreshing and continued emphasis Google puts on making even these complex systems easy for programmers to use:

The simplicity of a multi-homed system is particularly valuable for users. Without multi-homing, failover, recovery, and dealing with inconsistency are all application problems. With multi-homing, these hard problems are solved by the infrastructure, so the application developer gets high availability and consistency for free and can focus instead on building their application.

The biggest surprise in the paper was the idea that a multihomed system can actually take far fewer resources than a failover system:

In a multi-homed system deployed in three datacenters with 20% total catchup capacity, the total resource footprint is 170% of steady state. This is dramatically less than the 300% required in the failover design above

What’s Wrong With Failover?

Click to read more ...


Building nginx and Tarantool based services

Are you familiar with this architecture? A bunch of daemons are dancing between a web-server, cache and storage.

What are the cons of such architecture? While working with it we come across a number of questions: which language (-s) should we use? Which I/O framework to choose? How to synchronize cache and storage? Lots of infrastructure issues. And why should we solve the infrastructure issues when we need to solve a task? Sure, we can say that we like some X and Y technologies and treat these cons as ideological. But we can’t ignore the fact that the data is located some distance away from the code (see the picture above), which adds latency that could decrease RPS.

The main idea of this article is to describe an alternative, built on nginx as a web-server, load balancer and Tarantool as app server, cache, storage.

Improving cache and storage

Click to read more ...


Sponsored Post: Swrve, Netflix, Macmillan Learning, Aerospike, TrueSight Pulse, LaunchDarkly, Robinhood, Redis Labs, InMemory.Net, VividCortex, MemSQL, Scalyr, AiScaler, AppDynamics, ManageEngine, Site24x7

Who's Hiring?

  • Swrve -- In November we closed a $30m funding round, and we’re now expanding our engineering team based in Dublin (Ireland). Our mobile marketing platform is powered by 8bn+ events a day, processed in real time. We’re hiring intermediate and senior backend software developers to join the existing team of thirty engineers. Sound like fun? Come join us.

  • Macmillan Learning, a premier e-learning institute, is looking for VP of DevOps to manage the DevOps teams based in New York and Austin. This is a very exciting team as the company is committed to fully transitioning to the Cloud, using a DevOps approach, with focus on CI/CD, and using technologies like Chef/Puppet/Docker, etc. Please apply here.

  • DevOps Engineer at Robinhood. We are looking for an Operations Engineer to take responsibility for our development and production environments deployed across multiple AWS regions. Top candidates will have several years experience as a Systems Administrator, Ops Engineer, or SRE at a massive scale. Please apply here.

  • Senior Service Reliability Engineer (SRE): Drive improvements to help reduce both time-to-detect and time-to-resolve while concurrently improving availability through service team engagement.  Ability to analyze and triage production issues on a web-scale system a plus. Find details on the position here:

  • Manager - Performance Engineering: Lead the world-class performance team in charge of both optimizing the Netflix cloud stack and developing the performance observability capabilities which 3rd party vendors fail to provide.  Expert on both systems and web-scale application stack performance optimization. Find details on the position here

  • Software Engineer (DevOps). You are one of those rare engineers who loves to tinker with distributed systems at high scale. You know how to build these from scratch, and how to take a system that has reached a scalability limit and break through that barrier to new heights. You are a hands on doer, a code doctor, who loves to get something done the right way. You love designing clean APIs, data models, code structures and system architectures, but retain the humility to learn from others who see things differently. Apply to AppDynamics

  • Software Engineer (C++). You will be responsible for building everything from proof-of-concepts and usability prototypes to deployment- quality code. You should have at least 1+ years of experience developing C++ libraries and APIs, and be comfortable with daily code submissions, delivering projects in short time frames, multi-tasking, handling interrupts, and collaborating with team members. Apply to AppDynamics

Fun and Informative Events

  • Your event could be here. How cool is that?

Cool Products and Services

  • Powering the Internet of Moving Things with Aerospike. Geospatial data has become dynamic and rich in content. While high-performance NoSQL database Aerospike has shown the need for Speed at Scale for regular data, it’s needed for geospatial data too. Aerospike’s 3.7 release adds geospatial capabilities that enable developers to build high-throughput and low-latency applications around geospatial data. See how easy it is to build a Drone Delivery System – and other rich, complex applications that need Speed at Scale – using Aerospike’s geospatial capabilities and APIs, and watch the video here.

  • Dev teams are using LaunchDarkly’s Feature Flags as a Service to get unprecedented control over feature launches. LaunchDarkly allows you to cleanly separate code deployment from rollout. We make it super easy to enable functionality for whoever you want, whenever you want. See how it works.

  • TrueSight Pulse is SaaS IT performance monitoring with one-second resolution, visualization and alerting. Monitor on-prem, cloud, VMs and containers with custom dashboards and alert on any metric. Start your free trial with no code or credit card.

  • Turn chaotic logs and metrics into actionable data. Scalyr is a tool your entire team will love. Get visibility into your production issues without juggling multiple tools and tabs. Loved and used by teams at Codecademy, ReturnPath, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.

  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net

  • VividCortex measures your database servers’ work (queries), not just global counters. If you’re not monitoring query performance at a deep level, you’re missing opportunities to boost availability, turbocharge performance, ship better code faster, and ultimately delight more customers. VividCortex is a next-generation SaaS platform that helps you find and eliminate database performance problems at scale.

  • MemSQL provides a distributed in-memory database for high value data. It's designed to handle extreme data ingest and store the data for real-time, streaming and historical analysis using SQL. MemSQL also cost effectively supports both application and ad-hoc queries concurrently across all data. Start a free 30 day trial here:

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Also available on Amazon Web Services. Free instant trial, 2 hours of FREE deployment support, no sign-up required.

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below...

Click to read more ...


Egnyte Architecture: Lessons Learned in Building and Scaling a Multi Petabyte Distributed System

This is a guest post by Kalpesh Patel, an Architect, who works from home. He and his colleagues spends their productive hours scaling one of the largest distributed file-system out there. He works at Egnyte, an Enterprise File Synchronization Sharing and Analytics startup and you can reach him at @kpatelwork.

Your Laptop has a filesystem used by hundreds of processes, it is limited by the disk space, it can’t expand storage elastically, it chokes if you run few I/O intensive processes or try sharing it with 100 other users. Now take this problem and magnify it to a file-system used by millions of paid users spread across world and you get a roller coaster ride scaling the system to meet monthly growth needs and meeting SLA requirements.

Egnyte is an Enterprise File Synchronization and Sharing startup founded in 2007, when Google drive wasn't born and AWS S3 was cost prohibitive. Our only option was to roll our sleeves and build an object store ourselves, overtime costs for S3 and GCS became reasonable and because our storage layer was based on a plugin architecture, we can now plug-in any storage backend that is cheaper. We have re-architected many of the core components multiple times and in this article I will try to share what is the current architecture and what are the lessons  we learned scaling it and what are the things we can still improve upon.

The Platform

Click to read more ...


Stuff The Internet Says On Scalability For February 12th, 2016

Maybe this year's Mavericks can ride some gravitational waves?  


If you like this sort of Stuff then please consider offering your support on Patreon.

  • 3.96 Million: viewers streaming the Super Bowl; 1000 kilometers: roads made of solar panels in France; 500mg: amount of chlorophyll absorbing photons in a tree;

  • Quotable Quotes:
    • cmyr: soundcloud actually represents a very important cultural document; there is tons of music that has been created in the past 5+ years that exists exclusively there, and it would be a tremendous cultural loss if it were to disappear. 
    • aback: we will continue to endure substantial cultural losses for so long as people continue to believe that content can & should be distributed and consumed for free.
    • @dotemacs: - We’re moving to Java + Spring. - Why? - Cos of threads… and scaling… - OK … what exactly…? - I’m just going by what I was told.
    • @asolove: Chaos Monkey People: every week, one randomly-selected person must take the whole week off regardless of their current work. Org must adapt.
    • @waynejwerner: we almost started using them, but couldn't find examples of not Google / Netflix using containers in production
    • Oisin Hanrahan: Scaling is a never ending process of revisiting the same challenges at different volume points and learning from each and every one so that we get better and better. 
    • @nickcalyx: Q: What's the difference between USA and USB? A: One connects to all your devices & accesses your data, and the other is a hardware standard
    • @mathewi: In an alternate universe, Twitter never went public and is a profitable real-time news utility with an open API and multiple revenue streams
    • @etherealmind: “QOS in the Internet is like equipping fish with tricycles” - Geoff Huston - I laughed so hard. (podcast out later today)
    • Jef Akst: Plants may trick bacteria into attacking before the microbial population reaches a critical size, allowing the plants to successfully defend the weak invasion.
    • @CapgeminiIndia: Every uber car replaces 9 personal cars, digital innovation has a wider impact than you think. - Noshir Kaka #Nasscom_ILF
    • @EmperiorEric: CloudKit is fantastic, and its scaling tiers are not only amazing when free, but extremely cheap when not. But daily bandwidth scales slow.
    • Palmer Luckey: If a year from now we can have the people who buy into VR [Oculus] using their headsets every day, every week, coming back to it regularly, that’s what makes VR a huge success
    • Fister: all of the archives in the world could be stored in one box of seeds.
    • Annisa Cinderakasih: I really like the idea walking through the park/botanical garden which actually a ‘data library’, while I can listen to the narration of its data inside just by touching it.
    • @swardley: The time of war is upon us in gaming and many components of the gaming value chain are ripe for shifting from product to commodity (+utility) forms. 
    • @joyent: Q: "Are we at peak confusion in containers?" A: "5 more schedulers and we’re there." - @frazelledazzell from @docker at #ContainerSummit
    • @capotej: scala career timeline: year 1: dope, gonna write some terse code year 2: hmm, maybe i shouldnt use every feature year 3: java 8 looks nice
    • @skamille: OH: "I thought cloud native referred to engineers born after AWS launched?
    • AaronLasseigne: It's [Snapchat] a video game for her. She doesn't look at the photos in the morning, she just responds to keep the chain going. It's like grinding. She has a score, tries to improve it, get trophies, new equipment (I mean filters), etc.
    • @kentbye: Google will put out a Project Tango phone in 2016 with 3D depth sensors. It'll bootstrap consumer AR at mobile scale 
    • @pas256: Google Cloud Functions are completely different to AWS Lambda: exports.fn = function(context, data) vs exports.fn = function(event, context)
    • The Next Miracle Drug is an Algorithm.
    • @viktorklang: "The secret to very responsive systems is to keep the utilization down." - @mjpt777 (from Queuing Theory)
    • Mark Anderson: The cloud and the Internet of Things will likely provide plenty of snooping opportunities for the agency and others like it.
    • @WhatTheFFacts: It would take you 10 years to view all the photos shared on Snapchat in the last hour.
    • @kevinmarks: government should not be an app, but a protocol. Unauthenticated read, POST with nuance
    • @noggin143: the cern openstack cloud at 155K cores has no proprietary extensions IMHO.
    • @beaucronin: The VR strategy chess game is getting very serious 
    • @HoustonTexasVR: Blows my mind @Amazon can buy a game dev engine & release it free, no royalties just to promote uptake of its web services / cloud platform.
    • Kieren McCarthy: We're going to use your toothbrush to snoop on you, says US spy boss
    • Steve Ranger: In a surveillance economy, privacy represents an opportunity for profit forgone.
    • sudovoodoo: Should say that we horizontally scale this [] thing pretty heavily using the sc-redis module. Elasticache + ELB + 4 EC2 Instances = support for 5000+ person conferences :D
    • There are more Quotable Quotes in the full article.

  • Netflix has completed its 7 year long odyssey of moving all their operations to the cloud. Completing the Netflix Cloud Migration. It's really a love story. Netflix has grown their streaming membership 8x from 2008 and overall viewing has grown by three orders of magnitude. To support this growth they could not have racked servers fast enough in their own datacenter. Nor could they have grown world wide to support 130 new countries. Reliability is up, approaching four nines. Costs are down, costs per streaming start ended up being a fraction of those in the datacenter. Cloud elasticity is the driver for reduced costs. It's possible to continuously optimize instance type mix and to grow and shrink their footprint near-instantaneously without the need to maintain large capacity buffers. And on their voyage they've taken us all with them, teaching everyone what it means to operate at scale in the cloud. Netflix Open Source

  • 1 billion Apple devices are in active use around the world. Interesting metric, to talk about the number of devices instead of the number of users. Apple is saying we may not be able to grow users as fast as we used to, let's count devices instead, that we can grow, and BTW, we plan on making a lot more devices to sell to a loyal customer base. Also Wall Street, that's a pretty big market to sell accessories and services into. We're not dead yet.

  • Meta fatigue, there's no pill for it, but JavaScript fatigue fatigue offers tips against feeling overwhelmed: Don’t try to know everything; Wait for the critical mass; Stick to things you understand: don’t use more than 1–2 new technologies per project; Do exploratory toy projects.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Click to read more ...


How to build your Property Management System integration using Microservices

This is a guest post by Rafael Neves, Head of Enterprise Architecture at ALICE, a NY-based hospitality technology startup. While the domain is Property Management, it's also a good microservices intro.

In a fragmented world of hospitality systems, integration is a necessity. Your system will need to interact with different systems from different providers, each providing its own Application Program Interface (API). Not only that, but as you integrate with more hotel customers, the more instances you will need to connect and manage this connection. A Property Management System (PMS) is the core system of any hotel and integration is paramount as the industry moves to become more connected.


To provide software solutions in the hospitality industry, you will certainly need to establish a 2-way integration with the PMS providers. The challenge is building and managing these connections at scale, with multiple PMS instances across multiple hotels. There are several approaches you can leverage to implement these integrations. Here, I present one simple architectural design to building an integration foundation that will increase ROI as you grow. This approach is the use of microservices.

What are microservices? 

Click to read more ...


A Smallish List of Parse Migration Guides

Since Parse's big announcement it looks like the release of migration guides from various alternative services has died down. 

The biggest surprise is the rise of Parse's own open source Parse Server. Check out its commit velocity on GitHub. It seems to be on its way to becoming a vibrant and viable platform.

The immediate release of Parse Server with the announcement of the closing of Parse was surprising. How could it be out so soon? That's a lot of work. Some options came to mind. Maybe it's a version of an on-premise system they already had in the works? Maybe it's a version of the simulation software they use for internal testing? Or maybe they had enough advanced notice they could make an open source version of Parse? 

The winner is...

Charity Majors, formerly of Parse/Facebook, says in How to Survive an Acquisition, tells all:

Massive props to Kevin Lacker and those who saw the writing on the wall and did an amazing job preparing to open up the ecosystem.

That's impressive. It seems clear the folks at Parse weren't on board with Facebook's decision, but they certainly did everything possible to make the best out of a bad situation. It's even possible this closure could be a good thing for Parse in the long run if open source support continues to flourish.

Here's a list of different ways of getting from here to there...

Migration Guides

Click to read more ...


What's Next? The NFL's Magic Yellow Line Shows the Way to Augmented Reality


Update: Amazon just released Lumberyard, a free AAA game engine deeply integrated with AWS & Twitch.

What’s next? Mobile is entering its comforting middle age period of development. Conversational commerce is a thing, a good thing, but is it really a great thing?

What’s next may be what has been next for decades: Augmented reality (AR) (and VR). AR systems will be here sooner than you might think. A matter of years, not decades. Robert Scoble, for example, thinks Meta, an early startup in AR industry, will be bigger than the Macintosh. More on that in a later post. Magic Leap has no product and $1.3 billion in funding. Facebook has Oculus. Microsoft has HoloLens. Google may be releasing a VR system later this year. Apple is working on VR. Becoming the next iPhone is up for grabs.

AR is a Huge Opportunity for Programmers and Startups 

This is a technological revolution that will be bigger than mobile. Opportunities in mobile for developers have largely played out. Experience shows the earlier you get in on a revolution the better the opportunity will be. Do you want to be writing free iOS apps forever?

It’s so early we don’t really have an idea what AR is or what the market will be or what it means from a developer perspective. But if you watched the Super Bowl you saw an early example of the power of AR. It’s the benign looking, yet technically impressive, computer generated yellow first down line marker.

Augmented Reality is Already a Sports Reality

Click to read more ...