Feedblendr Architecture - Using EC2 to Scale

A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles. Site:

The Platform

  • EC2 (Fedora Core 6 Lite distro)
  • S3
  • Apache
  • PHP
  • MySQL
  • DynDNS (for round robin DNS)

    The Stats

  • Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr.
  • FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances.
  • Over 10,000 blends have been created, containing over 45,000 source feeds.
  • Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time).

    The Architecture

  • Round robin DNS is used to load balance between instances. -The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated. -Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots.
  • The database is still hosted on an external service because EC2 does not have a decent persistent storage system.
  • The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release.
  • The deployment process is: - Software is developed on a laptop and stored in subversion. - A makefile is used to get a revision, fix permissions etc, package and push to S3. - When the AMI launches it runs a script to grab the software package from S3. - The package is unpacked and a specific script inside is executed to continue the installation process. - Configuration files for Apache, PHP, etc are updated. - Server-specific permissions, symlinks etc are fixed up. - Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address.
  • Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine?

    Lesson Learned

  • A low budget startup can effectively bootstrap using EC2 and S3.
  • For the budget conscious the free ZoneEdit service might work just as well as the $50/year DynDNS service (which works fine).
  • Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to.
  • Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not.
  • It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved.
  • EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be.
  • Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one.
  • Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop.
  • Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network.

    Related Articles

  • What is a 'River of News' style aggregator?
  • Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

    Click to read more ...

  • Tuesday

    Database parallelism choices greatly impact scalability

    Sam Madden in the The Database Column blog covers some database architectures. Quick summary:

  • Shared-memory systems don't scale well as the shared bus becomes the bottleneck
  • Shared-disk systems don't scale well either
  • Shared-nothing scales the best

    Click to read more ...

  • Sunday

    Scaling Early Stage Startups

    Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups? Site:

    Information Sources

  • Slides from Seattle Tech Startup Talk.
  • Scaling Early Stage Startups blog post by Mark Maunder.

    The Platform

  • Linxux
  • An ISAM type data store.
  • Perl
  • Httperf is used for benchmarking.
  • is used for perf monitoring.

    The Architecture

  • Performance matters because being slow could cost you 20% of your revenue. The UIE guys disagree saying this ain't necessarily so. They explain their reasoning in Usability Tools Podcast: The Truth About Page Download Time. The idea is: "There was still another surprising finding from our study: a strong correlation between perceived download time and whether users successfully completed their tasks on a site. There was, however, no correlation between actual download time and task success, causing us to discard our original hypothesis. It seems that, when people accomplish what they set out to do on a site, they perceive that site to be fast." So it might be a better use of time to improve the front-end rather than the back-end.
  • MySQL was dumped because of performance problems: MySQL didn't handle a high number of writes and deletes on large tables, writes blow away the query cache, large numbers of small tables (over 10,000) are not well supported, uses a lot of memory to cache indexes, maxed out at 200 concurrent read/write queuries per second with over 1 million records.
  • For data storage they evolved to a fixed length ISAM like record scheme that allows seeking directly to the data. Still uses file level locking and its benchmarked at 20,000+ concurrent reads/writes/deletes. Considering moving to BerkelyDB which is a very highly performing and is used by many large websites, especially when you primarily need key-value type lookups. I think it might be interesting to store json if a lot of this data ends up being displayed on the web page.
  • Moved to httpd.prefork for Perl. That with no keepalive on the application servers uses less RAM and works well.

    Lessons Learned

  • Configure your DB and web server correctly. MySQL and Apache's memory usage can easily spiral out of control which leads gridingly slow performance as swapping increases. Here are a few resources for helping with configuration issues.
  • Serve only the users you care about. Block content theives that crawl your site using a lot of valuable resources for nothing. Monitor the number of content pages they fetch per minute. If a threshold is exceeded and then do a reverse lookup on their IP address and configure your firewall to block them.
  • Cache as much DB data and static content as possible. Perl's Cache::FileCache was used to cache DB data and rendered HTML on disk.
  • Use two different host names in URLs to enable browser clients to load images in parallele.
  • Make content as static as possible Create a separate Image and CSS server to serve the static content. Use keepalives on static content as static content uses little memory per thread/process.
  • Leave plenty of spare memory. Spare memory allows Linux to use more memory fore file system caching which increased performance about 20 percent.
  • Turn Keepalive off on your dynamic content. Increasing http requests can exhaust the thread and memory resources needed to serve them.
  • You may not need a complex RDBMS for accessing data. Consider a lighter weight database BerkelyDB.

    Click to read more ...

  • Saturday

    .Net2 and AJAX scalability?

    Am I mad to consider using .Net2 and AJAX for a high-scalability application? In case you wonder why, it's the legacy of a website built on IIS and .Net 1.1, and we're looking for ways to make the content more attractive and interactive. In this case, it's a medical image library being shared by a few Wikis and online coursework for medical students ( < 15K users) and doctors ( < 150K users) But I'm worried about the performance overhead. We already have a performance problem because of personalising the content for users according to their type (student or doctor), and for doctors, their grade and speciality.

    Click to read more ...


    How Gravatar scales on hardware

    Automattic recently purchase Gravatar and have switched the server onto their hosting platform. host over 1.7 million blogs with well over 60'000 new posts submitted each day generating 10 - 12 million page views per day. Barry on has a great post on the changes they've introduced to help Gravatar scale.

    Click to read more ...


    Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues

    Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.

    Related Articles

  • Wikimedia Architecture
  • Domas Mituzas' Blog

    Click to read more ...

  • Thursday

    Should JSPs be avoided for high scalability?

    I just heard about some web sites where Velocity templates are used to render HTML instead of using JSPs and all the processing in performed in servlets. Can JSPs cause issue with scalability? Thanks, Unmesh

    Click to read more ...


    Who can answer or analyze the image store and visit solution about

    Who can answer or analyze the image store and visit solution about

    Click to read more ...


    Scaling Operations Saves Money and Scales Faster

    Jesse Robbins at O'Reily Radar has a nice post on how spending a little up front time on figuring out how to scale your operations process saves money on ops people and allows you to save time adding and upgrading servers. Adding, monitoring, and upgrading servers can get so incredibly screwed up that a herd of squirrels has to work overtime just to put out a release. Or it can be one button simple from your automated build system out to your servers. This is one area where "do the simplest thing that could possibly work" is a dumb idea and Jesse does a good job capturing the advantages of doing it right.

    Click to read more ...


    Hire Facebook, Ning, and Salesforce to Scale for You

    One of the premier scaling strategies is always: get someone else to do the work for you. But unlike Huckleberry Finn in Tom Sawyer, you won't have to trick anyone into whitewashing a fence for you. Times have changed. Companies like Ning, Facebook, and Salesforce are more than happy to help. Their price: lock-in. Previously you had few options when building a "real" website. You needed to do everything yourself. Infrastructure and application were all yours. Then companies stepped in by commoditizing parts of the infrastructure, but the application was still yours. The next step is full on Borg take no prisoners assimilation where the infrastructure and application are built as one collective. What you have to decide as someone faced with building a scalable website is if these new options are worth the price. Feeding this explosion of choice is one of the new strategy games on the intertubes: the Internet Platform Game. Ning's Marc Andreessen defines a platform as: a system that can be programmed and therefore customized by outside developers -- users -- and in that way, adapted to countless needs and niches that the platform's original developers could not have possibly contemplated, much less had time to accommodate. The idea is you'll win great rewards in exchange for coding to someone else's internet platform. From Ning you'll win a featureful and customizable social networking platform that they are completely responsible for scaling. The cost ranges from free to very reasonable. From Facebook you'll win prime space on the profile page of over 40 million virally infected customers. It's free, but you must make your application scalable enough to handle all those millions. By coding to the Salesforce platform you'll win the same infrastructure that executes 100 million Salesforce transactions a day. The cost of their service is unknown at this time.

    The Three Levels of Internet Platforms

    Mr. Andreessen then went a step further and defined a three level platform categorization scheme:
  • Level 1: Access API. A platform provided in the form of a REST/SOAP web services API. Examples: eBay, Paypal, Flickr, Digg. Your application lives outside the service and their API is your only access point to the system. Scalability is completely up to you. You are basically building a mashup from distributed parts in your own data center.
  • Level 2: Plug-In API. A platform provided in the form of a system for embedding your application inside another application. Examples: Facebook, Eclipse, Firefox. You still use an API, but the user sees an integrated application because your application is using their screen real estate, log in, user accounts and so on. For internet plug-ins scalability is still up to you. The millions of Facebook users running your application must run completely on your servers.
  • Level 3: Deep hosting. A platform provided in the form an API, Plug-in, and fully hosted runtime environment. Examples: Ning, Salesforce, and Second Life. Your application is completely integrated with a host application framework and runs completely on the host servers. They are responsible adding machines, maintenance, and management. You are free to just write your application. Amazon is on his original list, but I don't put it there. If Amazon exposed their Dynamo service I would, but since with EC2 you are stuck worrying about database storage they really don't belong here. Like the typical depiction of human ascent from amoeba to weapon wielding, art appreciating primate, the levels are meant to indicate progress. While in reality evolution isn't about progress at all. It's all about survival through adaptation to local ecological niches. And that's how I look at the levels. At each level you gain something and you lose something. You need to select your niche by looking at your talents and needs.

    Why Use an Access API?

    Using open APIs to access services is what has made the internet great. APIs provide the most flexibility at the greatest cost. You get access to a huge number of wonderful services for virtually nothing. The linkage between website is a relatively simple API and a data definition. You can do anything you want, but you have to build the infrastructure to do it. Yet that's a lot better than building your own map service, your own SMS service, or your own photo sharing service. Yet there's still so much work to do. Grid services make the job easier, but the level of expertise it takes to create a scalable site is still very high.

    Why Use a Plug-In?

    Since Facebook is the only internet company in this category the answer is clear why you want to be a Facebook plug-in: to get access to a lot of users, connected by an exploitable social graph, for the purpose of exponentionally propagating your application along the graph. Most would be ecstatic to get to hundreds of thousands of regular users on their own standalone site. With Facebook that's very possible. The reward is great, but the costs are great too. Your application must be something that can be deconstructed onto Facebook. I don't see gmail making it as a Facebook app. You must subject yourself to a lot of restrictions to use the Facebook infrastructure. You must trust yourself to a poorly documented system in which it is hard to get anything done. And to top it off:Facebook does not host your application. This really blew me away when I first heard about it. When someone says they are offering a platform my immediate assumption is they are hosting your application. That's what a platform is, isn't it? But your application must run on your own hardware. Imagine going from 0 to millions of users in the space of a few days. How would you handle that? Well that's exactly the problem ILike (a popular music sharing site) had when they released their Facebook app. Mr. Andreessen gives a wonderful if somewhat self-serving account of ILike's troubles with viral growth. After launching they posted this on their blog: In our first 20 hours of opening doors we had 50,000 users sign up, and it is only accelerating. (10,000 users joined in the first 12 hrs. 10,000 more users in the next 3 hrs. 30,000 more users in the next 5 hrs!!) We started the system not knowing what to expect, with only 2 servers, but ready with backup. Facebook's rabid userbase chewed up our 2 servers almost instantly. We doubled our capacity to catch up. And then we doubled it again. And again. And again. Oh crap - we ran out of servers!! Although has a very healthy level of Web traffic, and even though about half of all the servers in our datacenter were sitting unused, idle, as backup capacity, we are now completely maxed out. We just emailed everybody we know across over a dozen Bay Area startups, corporations, and venture firms in a desperate plea to find spare servers so we can triple our capacity for the continued onslaught. Tomorrow we are picking up over 100 servers from different companies to have them installed just to handle the weekend's traffic. (For those who responded to our late night pleas, thank you!) ILike says they now have over 3 million Facebook users and are growing at an astonishing rate of 300,000 users per day. That number of users and growth rate will make almost anyone salivate. Yet how many can afford the hundreds and hundreds of servers it would take to handle all those users, especially if you have an unclear monetization strategy? Which brings us to Deep Hosting and Mr. Andreessen's end game for the internet's evolution.

    Why Use Deep Hosting?

    The trouble with handling application growth under Facebook's large user base has an obvious solution: host your application on their infrastructure. This is exactly what Mr. Andreessen has done with Ning. Out of the Ning box you get an exceptionally functional social networking package. So functional in fact it makes almost anyone think "do I really need to reinvent all this stuff when they've already done it? Can't I just tweak a few things and make it my own?" And that's exactly what Ning wants to hear. They've made it so you can completely rebrand their software, add your own features using normal programming tools, yet still host your application on their platform, on their servers, in their datacenter. So you don't have to worry about scaling. Its Ning's job to scale the database, back it up, manage the infrastructure, add servers, and do all the other nasty bits that keep so many people away from deploying successful websites. So the temptation is clear. Go with Ning and you immediately get a cool system that will scale and that you can still program if you feel the need. But with all that power comes a price, as usual. You are locked inside a gilded cage. If your application slows down there's not much you can do about it. I found their documentation better than Facebook's, but not very useful for someone looking to get going quickly and that makes me very nervous when adopting a platform. Yet when they add features, as they frequently do, your app gets them for free. You see some of the same effects here that all Google apps get when the Google stack is improved. And not having to worry about scalability is very attractive, especially at such a reasonable cost.

    Problems with Deep Hosting

    Mr. Andreessen thinks that "in the long run, all credible large-scale Internet companies will provide Level 3 platforms." There are three problems with this argument.
  • One: Ning has the same problem as Salesforce, only their part of the application infrastructure is scalable. What if I want to a add new service that is specific to my application? Let's say I want to send mass emailings for an invitation feature, for example? How do I make my infrastructure for this run inside their platform? I don't. Which means I have to be able build a scalable infrastructure anyway. Which means I might as well do the whole thing. But Ning might say their functionality is so compelling that it's worth the trade off. You can always make those external services. Which brings us back to if I have to do one part I might as well do it all. And it also brings us to the second problem with the L3 platform model.
  • Two: How compelling will each L3 domain be? You have to be very very attractive to even get someone to consider assimilating into a platform. Ning has done an excellent job at this. But how many other companies in how many other domains will do as a good a job? Precious few I would think.
  • Three: Mr. Andreessen maintains it is "really easy to learn how to program -- in fact, it's never been easier." So centering the L3 platform definition around programmability is not seen as a concern. But programming is not easy. It's very hard. Especially with such poorly documented systems. The more code you have to write the further you are away from your goal and the further you are away from adoption. This is why we see systems like Drupal with well defined plug-in architectures being very popular. Most people can't and won't ever program, so building things from pre-existing parts (like how our bodies evolved) allows people to get a lot of core functionality with the chance for specialization and expandability.

    What does this mean for you?

    I've found it difficult to reconcile all the different pros and cons of each approach. There is a definite value in all these alternatives. If you have a vision for an application then building it yourself is the only way you'll achieve that vision. So do it yourself. But what good is a vision without users? So go Facebook. But I could get something going very quickly in Ning and the expand overtime with much less hassle, even if it's not exactly what I want. So go Ning. What to do? The point of this post isn't to come to a conclusion. The point has been to cover some new and different approaches to scalability so you can spend a few sleepless nights pondering your options too :-)

    Related Articles

  • The three kinds of platforms you meet on the Internet by Marc Andreessen
  • Analyzing the Facebook Platform, three weeks in by Marc Andreessen
  • Q&A with iLike’s Ali Partovi, on Facebook By Eric Eldon
  • I want to understand Ning's architecture and how it works
  • Response to Three Platforms You Meet by Joshua
  • Ning's Developer Documentation
  • Facebook's Application Architecture
  • Saleforce's On-Demand Computing Platform
  • Building a Business on Virtual Infrastructure, Using Google and

    Click to read more ...