7 Secrets to Successfully Scaling with Scalr (on Amazon) by Sebastian Stadil

This is a part interview part guest with Sebastian Stadil, founder of Scalr, a cheaper open-source version of RightScale. Scalr takes care of all the web site infrastructure bits to on Amazon (and other clouds) so you don’t have to.

I first met Sebastian at one of the original Silicon Valley Cloud Computing Group meetups, a group which he founded. The meetings started in the tiny offices of Intalio where Sebastian was working with this new fangled Amazon thing to create an auto-scaling server farm on EC2. I remember distinctly how Sebastian met everyone at the door with a handshake and a smile, making us all feel welcome. Later I took one of the classes he  created on how to use AWS. I guess he figured all this cloud stuff was going somewhere and decided to start Scalr.

My only regret about this post is that the name Amazon does not begin with the letter ‘S’, that would have made for an epic title.

Getting to Know You

In this section are the responses Sebastian gave to a few background questions about Scalr.

What is Scalr?

Scalr is that guy you know that you turn to for help with server stuff: Scalr helps you create and manage scalable infrastructure.

When did you start?

Scalr was released to the community under the GPL in April 2008, so next month we're celebrating our two-year anniversary with Scalr 2.0!

What made you leave your other job and start Scalr?

Because infrastructure is so hard to do right! And scaling up so expensive!

How has the product evolved over time?

It started out EC2 only, with workarounds for the lack of EBS. It's now a fully functional system that uses the latest Amazon features like RDS and ELB to scale and manage everything for you.

How do you compare with RightScale and Amazon's offerings?

Think of Scalr as an open source RightScale, that focuses on one thing: scaling web applications. Amazon doesn't do much automation, and doesn't help you create a scalable architecture; if you have the slightest chance you might get big, check us out.

Where do you see all this cloud stuff going?

I see some burgeoning traction in hybrid cloud management, but really hope that within 3 years all new applications can be pushed to the intertubes in a single click, auto-managed and ready to scale. We're adding support for every other cloud platform

7 Secrets to Successfully Scaling with Scalr

I figured that over the years Sebastian has had a lot of experience working with AWS and must have a pocket full of tricks and traps. So I asked him what are the major issues to keep in mind when developing on Amazon and then explain how Scalr helps solved them. Here is Sebastian's response:

Don't hard code addresses

Things move fast in the cloud, especially IP addresses. The address you have now for your database, you might not have tomorrow or next week. This either comes involuntarily from stopping and starting instances on EC2, or voluntarily when moving compute and data around different resource pools, such as from EC2 to Rackspace Cloud. Elastic IPs - Amazon's service for binding an IP to any single server - solve this problem as long as you stay inside Amazon's services. But you can't assign it to a non-EC2 server, so you can't use it outside of AWS.

The problem here is that when you hard code your database address as an IP address in your application, and that IP address changes, your application will be sending queries and data to never never land - not a good thing, you'll agree. To work around this, it is a good practice to lookup the destination address prior to sending information to it - something DNS was made for.

We've made this easy to do in Scalr. Scalr keeps an updated list of IP addresses for every instance and hostname, in the form of DNS records on a cluster of nameservers. When you send database requests to db.example.com, the client looks up the IP address for that hostname in the DNS entries for that zone, and caches it until the TTL expires.

Moving compute to Rackspace Cloud? Scalr will add entries for your new database at Rackspace, and remove those pointing to EC2 IP addresses.

Read-write scaling goes a long way

With all the talk about sharding and noSQL these days, it's easy to underestimate how far you can go with a master-slave replicated cluster.

A master-slave replicated cluster is a set of multiple databases that sync data in a single direction. The master database is the custodian of all data, and is the one you write to: inserts, deletes, and updates. The slave database replicates data from the master, and holds a copy of it. This is the one you read from: select statements. This separation frees up resources on the master, which is often cpu bound, and allows you to make joins again without killing overall performance, since the slave handles the operation, not the master.

For further scalability and fault tolerance, there's also the option of master-master replication, in which the servers act as both masters and slaves to the others. A nifty trick here for high database availability is to put a master in each of two resource pools, so that if one disappears, the other takes over all traffic.

Scalr auto-configures replication between your database instances in case you don't know how to or don't want to, and adds a layer of protection from failure by automatically detecting failed database servers. In the case of a master database failure, it automatically promotes a slave to master, and reconfigures the other slaves to work with it.

Cron jobs are hard to distribute

Watch out when scaling out instances with cron jobs on them. Cron jobs aren't designed for the cloud. If the machine image holding your cron job scales out to 20 instances, your cron job will be executed 20 times more often.

This is fine if the scope of your cron job is limited to the instance itself, but if the scope is larger, the above becomes a serious problem. And if you single out a machine to run those cron jobs, you run the risk of not having it executed if that machine goes down.

You can work around this using SQS or any distributed queue service, but it's really quite bulky and time-consuming to setup, with not guarantee that the job will be executed on time.

The Apache Software Foundation has a neat tool for distributed lock services, called Zookeeper. Scalr based its distributed cron jobs off of it, so that users can setup scripts to be executed periodically, like cron jobs, without running the risk of multiple executions or failure to execute.

Dev and Prod should be twins

Differences between development and production environments can cause failure when promoting one to the other. Something as little as different instance sizes, for instance, can cause issues, especially on EC2 where the small instances are 32bit and large ones 64bit.

Traditionally companies run development and QA on dedicated servers located on premise. Since they are only used occasionally, for example a couple weeks before a release, companies will usually save on costs by getting only one server per type - this is especially true when your architecture requires non-permissive software licenses; getting extra Oracle licenses for a few weeks a year can be wasteful.

On the Cloud, things are not so. Since you pay-as-you-go, you can get servers for the limited period you use them. If you are using Scalr, this is even easier: just clone your production server farm, and name it QA. Use it for 8 hours Monday, then shut it down until Tuesday morning. You have a perfect copy of your production environment (same architecture, systems, and data), at a daily cost often under the price of lunch.

Better yet, make a snapshot of your database, and run Dev and QA with the same data as production. Use Apache Bench, and you can now have even the same load!

Learn from the community

It's hard enough to stay up-to-date on the latest infrastructure tricks, from Cassandra and Memcached to Varnish and nginx, I can't imagine what the amount of time it would take to learn all from scratch.

While there are many frameworks (like the Scalr project we build), and platforms (Heroku and the Google App Engine) that either make it easier to scale or abstract everything away, it's healthy to have some knowledge of the underlying mechanics. I have found that reading up on the OSI model of computer networking, tradeoffs in caching, Eric Brewer's CAP theorem, and the concept of eventual consistency has been time well spent.

I/O will be your bottleneck

Disk I/O will most likely bottleneck your growing application, and Raid over EBS quickly saturates network I/O (stuff has to go in then out, consuming twice the bandwidth). Use an in memory database where it makes sense, and cache everything!  We have found memcached to be easy enough to use, with dramatic effects on database load.

It is hard to work around ephemeral storage

If you are using local disk for data storage to get better I/O, you are at risk of losing data from an instance malfunction. This might be limited to replication lag, time since last snapshot, or since last database dump, but it makes recovery tedious and menial.

We've addressed this in projects like Scalr, where slave servers are automatically promoted to master in the event of a failure. However there is a tradeoff between downtime and data loss that we made, and we default to data loss.

  1. Open sourcers fortify Ubuntu's Koala food - Scalr adds vitamins to Eucalyptus by By Cade Metz.
  2. Scalr on Wikipedia
  3. Scalr Blog
  4. ZooKeeper  - A Reliable, Scalable Distributed Coordination System
  5. Scalr Source Code on Google Code