A big part of engineering for a quality experience is bringing in the long tail. An improbable severe failure can ruin your experience of a site, even if your average experience is quite good. That's where building for resilience comes in. Resiliency used to be outside the realm of possibility for the common system. It was simply too complex and too expensive.
An evolution has been underway, making 2013 possibly the first time resiliency is truly on the table as a standard part of system architectures. We are getting the clouds, we are getting the tools, and prices are almost low enough.
Even Netflix, real leaders in the resiliency architecture game, took some heat for relying completely on Amazon's ELB and not having a backup load balancing system, leading to a prolonged Christmas Eve failure. Adrian Cockcroft, Cloud Architect at Netflix, said they've investigated creating their own load balancing service, but that "we try not to invest in undifferentiated heavy lifting."
So resiliency is still not part of the standard package. There's an ROI calculation that has to be made. Yet the path Netflix would have to take in creating a hybrid architecture is fairly clear, Netflix prefers to concentrate on features rather than long tail events. That's a big difference. At one time designing for resiliency would have been unthinkable, now it's becoming a choice.
A good New Year's resolution might be to learn more about resilience. It's a new way of thinking compared to straightforward high availability. It's a full stack, full team, full system, environment centric mode of thought.
Fortunately, Dr. Richard Cook, Professor of Healthcare Systems Safety and Chairman of the Department of Patient Safety at the Kungliga Techniska Hogskolan, has been thinking about resilience for a long time. And he gave a fascinating talk: How Complex Systems Fail on resilience, that is just detailed enough to be practical and high level enough to inspire new directions.
@hackofalltrades: When positive change is only viewed through its scalability, bad things happen.
@faizanj: Is it time for #Netflix to move to a hybrid cloud architecture similar to Zynga zCloud?
@adrianco: we try not to invest in undifferentiated heavy lifting
@qui_oui: "scalability": a word that makes me think of how likely you are to have the ability to grow scales.
@Ninad_M: The question is, is #antifragile conceptually opposite of #bigdata
@pbailis: Batch your disk/network IO, kernel interrupts, customer package shipments -> delay arrival but increase efficiency
@Carnage4Life: One lesson that is hard for people to learn. Knowing that something occurred is different from knowing why it occurred
The best tech documentation both informs about the technology and teaches the wider context in which it plays a part. That fits the 400+ page Akka Documentation perfectly. In it you'll find excellent information on actors and the various architectures that can be created with them. Much to learn here.
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...
This topic has been ripped directly from Lambda the Ultimate's What will programming look like in 2020? post. They are having a lively discussion and if you are interested in flexing your holiday thought muscles we might have a good discussion too.
Eight years is a difficult prediction horizon. It's too short to simply project out current trends and it's too long to discount potential technological breakthroughs coming to market. There's the challenge.
Some of my lousy predictions:
Programmers Will Form Guilds Around New Gamified Training Hubs
The Web Will Become More Closed Before it Becomes More Open
Not Everyone Will Become a Programmer
Focus Will Shift to Creating Bigger People Instead of Chasing Big Ideas
Programmers Will Form Guilds Around New Gamified Training Hubs
Flurryhasbuilt large-scale app measurement and advertising services that are used by more than 80,000 media companies and independent developers to monetize mobile and related platforms. If you're interested in joining a thriving, growing team, please check us out.
Rumble Games is looking for a Senior Platform Engineer to build massively scalable and shared services for the next generation of online games. We have the best team this industry has seen, and we will transform the way people play together. Join us.
Duolingo, a fast-growing (>11% per week), free (no ads, no fees, no subscriptions) language learning site is looking for an infrastructure engineer to scale Duolingo to millions of users, please apply here.
We need awesome people @ Booking.com - We want YOU! Come design next
generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html
Hadapt is looking for software engineers. Come shape a cutting-edge technology while working in the fun, collaborative environment of a fast-paced start-up.
The New York Times is seeking a developer focused on infrastructure to join its newsroom development team. Read the full description here and send resumes to chadas@nytimes.com.
New Relic is looking for a Java Scalability Engineer in Portland, OR. Ready to scale a web service with more incoming bits/second than Twitter? http://newrelic.com/about/jobs
aiCache creates a better user experience by increasing the speed scale and stability of your web-site. Test aiCache acceleration for free. No sign-up required. http://aicache.com/deploy
Aerospike: Two Trillion Transactions per month...100 million stored user profiles...25% of all video ads processed on the internet - mere realities of success for Aerospike customers. Industry leaders reveal their secrets!
Follow the Cloudify blog to learn more about our open source PaaS stack – latest integration recipes, builds, features, and other cool stuff. Visit the GigaSpaces blog to learn how to take your application to the next level of scalability and performance.
NetDNA, a Tier-1 GlobalContent Delivery Network, offers a Dual-CDN strategy which allows companies to utilize a redundant infrastructure while leveraging the advantages of multiple CDNs to reduce costs.
LogicMonitor - Hosted monitoring of your entire technology stack. Dashboards, trending graphs, alerting. Try it free and be up and running in just 15 minutes.
AppDynamics is the very first free product designed for troubleshooting Java performance while getting full visibility in production environments. Visit http://www.appdynamics.com/free.
@shipilev: I've settled on saying that if performance is the scalar field in state space, then scalability is just it's gradient.
@AndiMann: "Only 1% of #Amazon users should care about #cloud scalability, elasticity". Brilliant!
@Guerrero_FJ: Always remember: 'scalability problems should be solved when there are scalability problems.' #leanstartup
Santa's Architecture: It's a little known fact that Santa Clause was an early queue innovator. Faced with the problem of delivering a planet full of presents in one night, Santa, in his hacker's workshop, created a Present Distribution System using thousands of region based priority present queues for continuous delivery by the Rudolphs. Rudolphs? You didn't think there was only one Rudolph did you? Presents are delivered in parallel by a cluster of sleighs, each with redundant reindeer in a master-master configuration. Each Rudolph is a cluster leader and they coordinate work using an early and more magical version of the ZooKeeper protocol.
...
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...
In MDCC: Multi-Data Center Consistency Murat discusses a paper that says synchronous wide-area replication can be feasible. There's a quick and clear explanation of Paxos and various optimizations that is worth the price of admission. We find that strong consistency doesn't have to be lost across a WAN:
The good thing about using Paxos over the WAN is you /almost/ get the full CAP (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.
To alleviate this latency versus consistency tension, this paper proposes RedBlue consistency, which enables blue operations to be fast/asynchronous (and eventually consistent) while the remaining red operations are strongly-consistent/synchronous (and slow). So a program is partitioned into red and blue operations, which run with different consistency levels. While red operations must be executed in the same order at all sites (which make them slow), the order of execution of blue operations can vary from site to site (allowing them to be executed without requiring coordination across sites). "In systems where every operation is labeled red, RedBlue consistency is equivalent to serializability; in systems where every operation is labeled blue, RedBlue consistency allows the same set of behaviors as eventual consistency."
Just a little fun holiday reading :-)
Murat also has number of excellent posts that are a great boon for understanding the innards of distributed systems:
It's a little known fact that Santa Clause was an early queue innovator. Faced with the problem of delivering a planet full of presents in one night, Santa, in his hacker's workshop, created a Present Distribution System using thousands of region based priority present queues for continuous delivery by the Rudolphs. Rudolphs? You didn't think there was only one Rudolph did you? Presents are delivered in parallel by a cluster of sleighs, each with redundant reindeer in a master-master configuration. Each Rudolph is a cluster leader and they coordinate work using an early and more magical version of the ZooKeeper protocol.
Programmers have followed Santa's lead and you can find a message queue in nearly every major architecture profile on HighScalability. Historically they may have been introduced after a first generation architecture needed to scale up from their two tier system into something a little more capable (asynchronicity, work dispatch, load buffering, database offloading, etc). If there's anything like a standard structural component, like an arch or beam in architecture for software, it's the message queue.
In a hole in the Internet there lived HighScalability:
$140 Billion: trivial cost of Google fiber everywhere; 5,200 GB: data for every person on Earth; 6 hours: time it takes for a 25-GPU cluster to crack all the passwords;
Quoteable Quotes:
hnriot: Good architecture eliminates the need for prayer.
@adrianco: we break AWS, they fix it. Stuff that's breaking now is mostly stuff other clouds haven't got to yet.
Scalability Rules: Design for 20x capacity. • Implement for 3x capacity. • Deploy for ~1.5x capacity.
Fast typing Aaron Delp with his AWS re:Invent Werner Vogel Keynote Live Blog. Some key points: Decompose into small loosely coupled, stateless building blocks; Automate your application and processes; Let Business levers control the system; Architect with cost in mind; Protecting your customer is the first priority; In production, deploy to at least two availability zones; Integrate security into your application from the ground up; Build, test, integrate and deploy continuously; Don't think in single failures; Assume Nothing.
Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...
We've long known one of the virtues of the cloud is, through the magic of services and automation, that systems can be shut or tuned down when not in use. What may be surprising is how much money can be saved.