Urbandive is an immersive view service launched by the French YellowPages which allows you to travel in cities in France thanks to a 360° view. Urbandive focuses on providing high definition pictures and accurate professional and social content. One of the biggest jobs was to enable a fast scalable architecture, because it was really difficult to forecast the traffic load at production time. Traffic load may be influenced if the service receives attention from users as a result of advertising.
Workflow & XTR-Lucid
Our scalability combo is : a home-made Ruby scheduler (XTR-Lucid) to deal with AWS APIs + the Puppet Master to install services and configure EC2 instances and keep them up-to-date during all the production time. This leads to full automation.
Here is the workflow (for the creation step, there are other workflows for stop/reboot/health-check/…) of our automation tool. The dashboard allows you to select a template (which contains the following informations : AMI id, instance type, availability zone, key, list of security groups, list of EBS – from snapshots or not -, …) and to set a name for the instance in the « create » workflow.
This schema gives the general idea of how it works.
To be more accurate and to explain the workflow just a bit :
- The scheduler can be outside or inside of AWS.
- First an action file is put into the TODO directory, by the web application (simple rhtml dashboard) or by the monitoring tool directly. For posting files from the monitoring tool, we will have to define thresholds after some weeks of use of the application. At this time, we have not enough data feedback as we just commenced production.
- The action file is then processed by the scheduler :
- Connection to AWS and request to start an EC2 instance from the AMI.
- Scheduler checks that instance is running, EBS (Elastic Block Store) are available, then in-use, and eventually that EC2 TCP stack is up and SSH is OK. Then it connects to the brand-new instance.
- Puppet client is installed and started, so it sends a certificate request to the master.
- Puppet client is stopped to avoid any interaction.
- Connection to the PuppetMaster : update the main files (nodes files of PuppetMaster, « /etc/hosts » file, roles file of Capistrano) and then accept the certificate request of the client.
- Capistrano updates the « /etc/hosts » file on all instances of our infrastructure as soon as a new EC2 instance comes up (or down), without having to trigger (puppetrun) the simultaneous « pull configuration » of all the Puppet clients.
- Connection to the new instance.
- Puppet client is started.
- Puppet client connects to the master and authenticate. Then it gets what it needs to install and configure the new instance.
- The installation occurs on the new instance that will be soon available.
- The scheduler doesn’t wait until the instance is fully configured to end the task : installation by Puppet is asynchronous. The scheduler just ends by updating its repository and the monitoring tool.
Technical note 1 : we use Capistrano mainly for application deployment. In the above case (create instance), we use it to update the « /etc/hosts » file on all instances of our infrastructure as soon as a new EC2 instance comes up (or down – it is the same use for stop instance), without having to trigger (puppetrun) the simultaneous « pull configuration » of all the Puppet clients (bad idea ;ob). That’s just a little tip for this workflow.
Technical note 2 : the Ruby scheduler works in asynchronous mode (for security reasons & to be called by other tools – like a monitoring one) by creating action files from the web application in the TODO directory (which can be filled by the monitoring tool for auto-scalability) checked every minute by a CRON which calls the heart of the automation tool.
- We have actually 40 – 50 servers, that’s our base, but we probably will go higher when the traffic will come with marketing campaigns.
- We use Lucid Lynx Ubuntu OS (from a standard AMI powered by Canonical).
- We use the PuppetMaster installed on Apache with Phusion Passenger to increase performances.
- We took Puppet (standard version) right from the Ubuntu repositories packaged in the AMI.
Why use Puppet (especially on AWS)?
Consistency & Flexibility
We have multiple services, so we need to be sure that each group of servers is consistent in terms of configuration. If we need to push a quick change on all nodes of a type at a given time, we can even push (trigger « pull », to be more accurate, with puppetrun) this new setting from the PuppetMaster without waiting the pull of Puppet client (30min period) and we are sure that this one is pushed everywere it is needed.
AWS offer AMIs to « snapshot » an instance and clone this one as many as we want. This is very good for developpement, load tests, .. But when we come into production, we cannot build another AMI each time we do a change on a setting and deploy again EC2 instances from the new AMI. We need something flexible to patch our configurations quickly, keeping consistency at all times.
Scalability & Ease
This is a new Internet service which can have a load peak at anytime depending on marketing campaigns. But we have to deal with financial things too. So we keep a base and have to start quickly full configured up-to-date instances within a few minutes. We achieve this by writing a home-made ruby-script-based scheduler (XTR-Lucid) that deals with AWS APIs to launch/stop EC2, add EBS (empty or from snapshot), add/remove to/from ELB, add/delete the new/old host to/from our monitoring tool, … AWS EC2 instances are kept in a repository on our Ruby scheduler. Then our tool lets Puppet take over which ensures for each brand-new instance started from a standard Lucid Lynx AMI that all services, configurations, security rules are installed/applied. So with the feedback of our monitoring tool, the infrastructure can grow and decrease alone (depending on the thresholds set) or we can deploy and stop instances in just one click on our web dashboard.
In such a project, there is a lot of various services to monitor for the Ops team (And… Yes ! Ops team is always essential, even with automated infrastructure ! :o)). So this is important to capitalize upon knowledge in Puppet descriptors and XTR-Lucid repository and templates, so that by simply reading the descriptors/repository/templates, the whole team knows all what it needs to know at a glance (no need to look everywhere). Things become a lot easier.
Because of the diversity of services run in the project, maybe not all will fit into AWS in a long term experience (over one year). Once we have real-life feedback on traffic, some services will stay on AWS, others, maybe, will migrate into our physical datacenter. So it will be easier to migrate because all the configuration is concentrated in Puppet descriptors. This enables us to re-deploy easily a brand-new infrastructure on another plateform (even if it is not in the Cloud).
I hope you enjoyed this. You can find some more information on the way the Ops job is evolving, based on this experience (among others) : Solutions Linux / Open Source 2011 – Le métier de l’Administration Système avec le Cloud Computing (FR). These slides are on a talk I gave at Solution Linux 2011 (France).