A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles.
Site: http://feedblendr.com/
The Platform
EC2 (Fedora Core 6 Lite distro)
S3
Apache
PHP
MySQL
DynDNS (for round robin DNS)
The Stats
Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr.
FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances.
Over 10,000 blends have been created, containing over 45,000 source feeds.
Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time).
The Architecture
Round robin DNS is used to load balance between instances. -The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated. -Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots.
The database is still hosted on an external service because EC2 does not have a decent persistent storage system.
The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release.
The deployment process is: - Software is developed on a laptop and stored in subversion. - A makefile is used to get a revision, fix permissions etc, package and push to S3. - When the AMI launches it runs a script to grab the software package from S3. - The package is unpacked and a specific script inside is executed to continue the installation process. - Configuration files for Apache, PHP, etc are updated. - Server-specific permissions, symlinks etc are fixed up. - Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address.
Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine?
Lesson Learned
A low budget startup can effectively bootstrap using EC2 and S3.
For the budget conscious the free ZoneEdit service might work just as well as the $50/year DynDNS service (which works fine).
Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to.
Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not.
It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved.
EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be.
Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one.
Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop.
Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network.