Strategy: Cache Application Start State to Reduce Spin-up Times

Using this strategy, Valyala, a commenter on Are Long VM Instance Spin-Up Times In The Cloud Costing You Money?, was able to reduce their GAE application start-up times from 15 seconds down to to 1.5 seconds:

Spin-up time for newly added Google AppEngine instances can be reduced using initial state caching. Usually the majority of spin-up time for the newly created GAE instance is spent in the pre-populating of the initial state, which is created from many data pieces loaded from slow data sources such as GAE's datastore. If the initial state is identical among GAE instances, then the entire state can be serialized and stored in a shared memory (either in the memcache or in the datastore) by the first created instance, so newly created instances could load and quickly unserialize the state from a single blob loaded from shared memory instead of spending a lot of time for creation of the state from multiple data pieces loaded from the datastore.
I reduced spin-up time for new instances of my GAE application from 15 seconds to 1.5 seconds using this technique.
Theoretically the same approach could be used for VM-powered clouds such as Amazon EC2, if the cloud will be able fork()'ing new VMs from the given initial state. Then application developers could boot and pre-configure required services in the 'golden' VM, which then will be stored in a snapshot somewhere in a shared memory. The snapshot will be used for fast fork()'ing of new VMs. The VM's fork() can be much faster comparing to the cold boot of a new VM with required services.

As another commenter noted, GAE now has an Always On feature, which keeps three instances of your app running, but the rub here is you have to pay for the resources you are using. This approach minimizes costs and works across different types of infrastructures.

I've successfully used similar approaches for automatically starting, configuring, and initializing in-memory objects across a cluster. In this architecture:

  • Each object has an ID that is mapped to a bag of attributes. Some of those attributes are configuration attributes, some are events, alarms, and dynamic attributes for holding current state.
  • On each node a software system is in charge of figuring out which objects are assigned to which nodes, creating all those objects, and running each object through a startup state machine which includes the object retrieving its state from the database and performing any other required initialization.
  • When all objects have moved to a ready state the node itself would be considered ready for service. The node status was sent to all other nodes which now knew they could use that node for service.

This works great. It minimizes the burden on the application programmer, makes node bring-up fast and easy, and feed directly into an automatic replication and fail-over system.