« 10 Hot Scalability Links for January 13, 2010 | Main | Have We Reached the End of Scaling? »

Strategy: Don't Use Polling for Real-time Feeds

Ivan Zuzak wrote a fascinating article on Real-time feed processing and filtering using Google App Engine to build Feed-buster, a service that inserts MediaRSS tags into feeds that don't have them. He talks about using polling and PubSubHubBub (real-time) to process FriendFeed feeds. Ivan is trying to devise a separate filtering service where: 

  1. filtering services should be applied as close to the publisher as possible so notifications that nobody wants don’t waste network resource.
  2. processing services should be applied as close to the subscriber so that the original update may be transported through the network as a single notification for as long as possible.

Besides being a generally interesting article, Ivan makes an insightful observation on the nature of using polling services in combination with metered Infrastructure/Platform services:

Polling is bad because AppEngine applications have a fixed free daily quota for consumed resources, when the number of feeds the service processed increased - the daily quota was exhausted before the end of the day because FF polls the service for each feed every 45 minutes.

This fits directly in with the ideas in Cloud Programming Directly Feeds Cost Allocation Back into Software Design. My general preference is to poll a distributed queue for work items. It's robust and allows your system to control it's own resource usage by determining when to poll. Otherwise you can easily be overwhelmed by fast pushers. Here the overwhelming is going the other way. Your budget is being overwhelmed by the polling requests. And the more you try approximate real-time with frequent polling requests the more your budget is busted.
It's a cool example of how costs, algorithm, and platform choices all feed into and shape product architectures.


Reader Comments (2)

Thanks Todd! I've been following your blog for some time now and it's really great to be mentioned here.

I agree with you on the points that algorithms and system design will be changed and affected by the new billing models of cloud computing platforms. The Big $ notation you imagined can be tied in with cloud interoperability (http://cloudforum.org/): if I develop an application with a $(A) for a cloud infrastructure X, and then wish to move to infrastructure Y (e.g. moving from GAE to Amazon), then not only will the program probably need to be rewritten to a new programming language for Y but a design with $(B) != $(A) will possibly be needed. And possibly, and really going into SciFi territory here, this would be done automatically. A cross-cloud compiler maybe? :)

Your idea of decoupling the service from the consumer with a request queue and then polling the queue for work items on the service side is indeed great for controlling resource consumption. I have a feeling that these kind of infrastructure mechanisms *must* be a part of every cloud computing platforms, together with reflective APIs which provide detailed insight into how the application is consuming resources. When combined, these two enable the application to programmatically and dynamically scale itself. I believe this is the motivation behind Task Queues in GAE, for example, but it covers only a part of the needed functionality (since, AFAIK, there is no way to determine how much resources the application has consumed, from within the application).

January 11, 2010 | Unregistered CommenterIvan Zuzak

http://rsscloud.org/ + http://realtimerss.org/

January 12, 2010 | Unregistered CommenterAuthor

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>