advertise
Saturday
May032008

Product: nginx

Update 6: nginx_http_push_module. Turn nginx into a long-polling message queuing HTTP push server.

Update 5: In Load Balancer Update Barry describes how WordPress.com moved from Pound to Nginx and are now "regularly serving about 8-9k requests/second and about 1.2Gbit/sec through a few Nginx instances and have plenty of room to grow!".
Update 4: Nginx better than Pound for load balancing. Pound spikes at 80% CPU, Nginx uses 3% and is easier to understand and better documented.
Update 3: igvita.com combines two cool tools together for better performance in Nginx and Memcached, a 400% boost!.
Update 2: Software Project on Installing Nginx Web Server w/ PHP and SSL. Breaking away from mother Apache can be a scary proposition and this kind of getting started article really helps easy the separation.
Update: Slicehost has some nice tutorials on setting up Nginx.

From their website:
Nginx ("engine x") is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3/SMTP proxy server. Nginx was written by Igor Sysoev for Rambler.ru, Russia's second-most visited website, where it has been running in production for over two and a half years. Igor has released the source code under a BSD-like license. Although still in beta, Nginx is known for its stability, rich feature set, simple configuration, and low resource consumption.



Bob Ippolito says of Nginx:

The only solution I know of that's extremely high performance that offers all of the features that you want is Nginx... I currently have Nginx doing reverse proxy of over tens of millions of HTTP requests per day (thats a few hundred per second) on a single server. At peak load it uses about 15MB RAM and 10% CPU on my particular configuration (FreeBSD 6).

Under the same kind of load, Apache falls over (after using 1000 or so processes and god knows how much RAM), Pound falls over (too many threads, and using 400MB+ of RAM for all the thread stacks), and Lighty leaks more than 20MB per hour (and uses more CPU, but not significantly more).

See Also

 

  • Nginx vs. Lighttpd for a small VPS
  • nginx: high performance smpt/pop/imap proxy
  • Light Weight Web Server
  • Nginx and Mirror on Demand
  • Running Drupal with Clean URL on Nginx or Lighttpd
  • Goodbye Pound, Hello Nginx
  • Using Nginx, SSI and Memcache to Make Your Web Applications Faster
  • Ruby on Rails hosting with Nginx
  • NGINX Tutorial: Developing Modules
  • Friday
    May022008

    Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App

    Update: Jake in Does Django really scale better than Rails? thinks apps like FFS shouldn't need so much hardware to scale.

    In a short three months Friends for Sale (think Hot-or-Not with a market economy) grew to become a top 10 Facebook application handling 200 gorgeous requests per second and a stunning 300 million page views a month. They did all this using Ruby on Rails, two part time developers, a cluster of a dozen machines, and a fairly standard architecture. How did Friends for Sale scale to sell all those beautiful people? And how much do you think your friends are worth on the open market? 

    Site: http://www.facebook.com/apps/application.php?id=7019261521

    Information Sources

  • Siqi Chen and Alexander Le, co-creators of Friends for Sale, answering my standard questionairre.
  • Virality on Facebook

    The Platform

  • Ruby on Rails
  • CentOS 5 (64 bit)
  • Capistrano - update and restart application servers.
  • Memcached
  • MySQL
  • Nginx
  • Starling - distributed queue server
  • Softlayer - hosting service
  • Pingdom - for website monitoring
  • LVM - logical volume manager
  • Dr. Nics Magic Multi-Connections Gem - split database reads and writes to servers

    The Stats

  • 10th most popular application on Facebook.
  • Nearly 600,000 active users.
  • Half a million unique visitors a day and growing fast.
  • 300 million page views a month.
  • 300% monthly growth rate, but that is plateauing.
  • 2.1 million unique visitors in the past month
  • 200 requests per second.
  • 5TB of bandwidth per month.
  • 2 part time (now full time), and 1 remote DBA contractor.

  • 4 DB servers, 6 application servers, 1 staging server, and 1 front end server.
    - 6, 4 core 8 GB application servers.
    - Each application server runs 16 mongrels for a total of 96 mongrels. -
    - 4 GB memcache instance on each application server
    - 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup

    Getting to Know You

  • What is your system is for?

    Our system is designed for our Facebook application, Friends for Sale.
    It's basically Hot-or-Not with a market economy. At the time of this
    writing it's the 10th most popular application on Facebook.

    Their Facebook description reads: Buy and sell your friends as pets! You can make your pets poke, send gifts, or just show off for you.
    Make money as a shrewd pets investor or as a hot commodity! Friends for Sale is the bees knees!


  • Why did you decide to build this system?

    We designed this as more of an experiment to see if we understood virality concepts and metrics on Facebook. I guess we do. =)

  • What particular design/architecture/implementation challenges do your system have?

    As a Facebook application, every request is dynamic so no page caching is possible. Also, it is a very interactive, write heavy application so scaling the database was a challenge.

  • What did you do to meet these challenges?

    We memcached extensively early on - every page reload results in 0 SQL calls. We use Rail's fragment caching with custom expiration logic mostly.

  • How big is your system?

    We had more than half a million unique visitors yesterday and growing fast. We're on track to do more than 300 million page views this month.

  • What is your in/out bandwidth usage?

    We used around 3 terabytes of bandwidth last month. This month should be at least 5TB or so. This number is just for a few icons and XHTML/CSS.

  • How many documents, do you server? How many images? How much data?

    We don't really have unique documents ... we do have around 10 million user profiles though.

    The only images we store are a few static image icons.

  • How fast are you growing?

    We went from around 3M page views per day a month ago to more than 10M page views a day. A month before that we were doing 1M page views per day. So that's around a 300% monthly growth rate but that is plateauing. On a request per second basis, we get around 200 requests per second.

  • What is your ratio of free to paying users?

    It's all free.

  • What is your user churn?

    It's around 1% per day, with a growth rate of 3% or so per day in terms of installed users.

  • How many accounts have been active in the past month?

    We had roughtly 2.1 million unique visitors in the past month according to Google.

  • What is the architecture of your system?

    It's a relatively standard Rails cluster. We have a dedicated front end proxy balancer / static web server running nginx, which proxies directly to 6, 4 core 8 GB application servers. Each application server runs 16 mongrels for a total of 96 mongrels. The front end load balancer proxies directly to the mongrel ports. In addition, we run a 4 GB memcache instance on each application server, along with a local starling distributed queue server and misc background processes.

    We use god to monitor our processes.

    On the DB layer, we have 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup. We use Dr Nic's magic multi-connection's gem in production split reads and writes to each
    box.

    We are adding more slaves right now so we can distribute the read load better and have better redundancy and backup policies. We also get help from Percona (the mysqlperformanceblog guys) for remote DBA work.

    We're hosted on Softlayer - they're a fantastic host. The only problem was that their hardware load balancing server doesn't really work very well ... we had lots of problems with hanging connections and latency. Switching a dedicated box running just nginx fixed everything.

  • How is your system architected to scale?

    It really isn't. On the application layer we are shared-nothing so it's pretty trivial. On the database side we're still with a monolithic master and we're trying to push off sharding for as long as we can. We're still vertically scaled on the database side and I think we can get away with it for quite some time.

  • What do you do that is unique and different that people could best learn from?

    The three things that are unique is -

    1. Neither of the two developers in involved had previous experience in large scale Rails deployment.
    2. Our growth trajectory is relatively rare in the history of Rails deployments
    3. We had very little opportunity for static page caching - each request does hit the full Rails stack

  • What lessons have you learned? Why have you succeeded? What do you wish you would have done differently? What wouldn't you change?

    We learned that a good host, good hardware, and a good DBA are very important. We used to be hosted on Railsmachine, which to be fair is an excellent shared hosting company and they did go out of there way to support us. In the end though, we were barely responsive for a good month due to hardware problems, and it only took two hours to get up and running on Softlayer without a hitch. Choose a good host if you plan on scaling, because migrating isn't fun.

    The most important thing we learned is that your scalability problems is pretty much always, always, always the database. Check it first, and if you don't find anything, check again. Then check again. Without exception, every performance problem we had can be traced to the database server, the database configuration, the query, or the use and non-use of indices.

    We definitely should have gotten on to a better host earlier in the game so we would have been up.

    We definitely wouldn't change our choice of framework - Rails was invaluable for rapid application development, and I think we've pretty much proven that two guys without a lot of scaling experience can scale a Rails app up. The whole 'but does Rails scale?' discussion sounds like a bunch of masturbation - the point is moot.

  • How is your team setup?

    We have two Rails developers, inclusive of me. We very recently retained the services of a remote DBA for help on the database end.

  • How many people do you have?

    On the technical side, 2 part time (now full time), and 1 remote DBA contractor.

  • Where are they located?

    The full time employees are also located in the SOMA area of San Francisco.

  • Who performs what roles?

    The two developers server as co-founders . I (Siqi) was responsible for front end design and development early on, but since I had some experience with deployment I also ended up handling network operations and deployment as well. My co founder Alex is responsible for the bulk of the Rails code - basically all the application logic is from him. Now I find myself doing more deep back end network operations tasks like MySQL optimization and replication - it's hard to find time to get back to the front end which is what I love. But it's been a real fun learning experience so I've been eating up all I can from this.

  • Do you have a particular management philosophy?

    Yes - basically find the smartest people you can, give them the best deal possible, and get out of their way. The best managers GET OUT OF THE WAY, so I try to run the company as much as I can with that in mind. I think I usually fail at it.

  • If you have a distributed team how do you make that work?

    We'd have to have some really good communication tools in the cloud - somebody would have to be a Basecamp nazi. I think remote work / outsourcing is really difficult - I prefer to stay away with from it
    for core development. For something like MySQL DBA or even sysadmin - it might make more sense.

    What do you use?

    We use Rails with a bunch of plugins, most notable cache-fu from Chris Wanstrath and magic multi connections from Dr. Nic. I use VIM as the editor with the rails.vim plugin.

  • Which languages do you use to develop your system?

    Ruby / Rails

  • How many servers do you have?

    We now have 12 servers in the cluster.

  • How are they allocated?

    4 DB servers, 6 application servers, 1 staging server, and 1 front end server.

  • How are they provisioned?

    We order them from Softlayer - there's a less than 4 hour turn around for most boxes, which is awesome.

  • What operating systems do you use?

    CentOS 5 (64 bit)

  • Which web server do you use?

    nginx

  • Which database do you use?

    MySQL 5.1

  • Do you use a reverse proxy?

    We just use nginx's built in proxy balancer.

  • How is your system deployed in data centers?

    We use a dedicated hosting service, Softlayer.

  • What is your storage strategy?

    We use NAS for backups but internal SCSI drives for our production boxes.

  • How much capacity do you have?

    Across all of our boxes we probably have around ... 5 TB of storage or
    thereabouts.

  • How do you grow capacity?

    Ad-hoc. We haven't done a proper capacity planning study, to our detriment.

  • Do you use a storage service?

    Nope.

  • Do you use storage virtualization?

    Nope.

  • How do you handle session management?

    Right now we just persist it to the database - it would be fairly easy to use memcache directly for this purpose though.

  • How is your database architected? Master/slave? Shard? Other?

    Master/slave right now. We're moving towards a Master/Multi-slave with a read only load balancing proxy to the slave cluster.

  • How do you handle load balancing?

    We do it in software via nginx.

  • Which web framework/AJAX Library do you use?

    Rails.

  • Which real-time messaging frame works do you use?

    None.

  • Which distributed job management system do you use?

    Starling

  • How do you handle ad serving?

    We run network ads. We also weight our various ad networks by eCPM on our application layer.

  • Do you have a standard API to your website?

    Nope.

  • How many people are in your team?

    2 developers.

  • What skill sets does your team possess?

    Me: Front end design, development, limited Rails. Obviously, recently proficient in MySQL optimization and large scale Rails deployment.
    Alex: application logic development, front end design, general software engineering.

  • What is your development environment?

    Alex develops on OSX while I develop on Ubuntu. We use SVN for version control. I use VIM for editing and Alex uses TextMate.

  • What is your development process?

    On the logic layer, it's very test driven - we test extensively. On the application layer, it's all about quick iterations and testing.

  • What is your object and content caching strategy?

    We cache both in memcache with no TTL, and we just manually expire.

  • What is your client side caching strategy?

    None.

    How do you manage your system?

  • How do check global availability and simulate end-user performance?

    We use Pingdom for external website monitoring - they're really good.

  • How do you health check your server and networks?

    Right now we're just relying on our external monitoring and Softlayer's ping monitoring. We're investigating FiveRuns for monitoring as a possible solution to server monitoring.

  • How you do graph network and server statistics and trends?

    We don't.

  • How do you test your system?

    We deploy to staging and run some sanity tests, then we do a deploy to all application servers.

  • How you analyze performance?

    We trace back every SQL query in development to make sure we're not doing any unnecessary calls or model instantiations. Other than that, we haven't done any real benchmarking.

  • How do you handle security?

    Carefully.

  • How do you decide what features to add/keep?

    User feedback and critical thinking. We are big believers in simplicity so we are pretty careful to consider before we add any major features.

  • How do you implement web analytics?

    We use a home grown metrics tracking system for virality optimization,
    and we also use Google Analytics.

  • Do you do A/B testing?

    Yes, from the time to time we will tweak aspects of our design to optimize for virality.

    How is your data center setup?

  • Which firewall product do you use?
  • Which DNS service do you use?
  • Which routers do you use?
  • Which switches do you use?
  • Which email system do you use?
  • How do you handle spam?
  • How do you handle virus checking of email and uploads?

    Don't know to all of the above.

  • How do you backup and restore your system?

    We use LVM to do incrementals on a weekly and daily basis.

  • How are software and hardware upgrades rolled out?

    Right now they are done manually, except for new Rails application deployments. We use capistrano to update and restart our application servers.

  • How do you handle major changes in database schemas on upgrades?

    We usually migrate on a slave first and then just switch masters.

  • What is your fault tolerance and business continuity plan?

    Not very good.

  • Do you have a separate operations team managing your website?

    Oh we wish.

  • Do you use a content delivery network? If so, which one and what for?

    Nope

  • What is your revenue model?

    CPM - more page views more money. We also have incentivized direct offers through our virtual currency.

  • How do you market your product?

    Word of mouth - the social graph. We just leverage viral design tactics to grow.

  • Do you use any particularly cool technologies are algorithms?

    I think Ruby is pretty particularly cool. But no, not really - we're not doing rocket science, we're just trying to get people laid.

  • Do your store images in your database?

    No, that wouldn't be very smart.

  • How much up front design should you do?

    Hm. I'd say none if you haven't scaled up anything before, and a lot if you have. It's hard to know what's actually going to be the problem until you've actually been through and see what real load problems look like. Once you've done that, then you have enough domain knowledge to do some actual meaningful up front design on our next go around.

  • Has anything surprised your either for the good or bad?

    How unreliable vendor hardware can be, and how different support can be from host to host. The number one most important thing you will need is a scaled up dedicated host who can support your needs. We use Softlayer and we can't recommend them highly enough.

    On the other hand, it's surprising how far just a master-multislave setup can take you on commodity hardware. You can easily do a Billion page views per month on this setup.

  • How does your system evolve to meet new scaling challenges?

    It doesn't really, we just fix bottle necks as they come and we see them coming.

  • Who do you admire?

    Brad Fitzpatrick for inventing memcache, and anyone who has successfully horizontally scaled anything.

  • How are you thinking of changing your architecture in the future?

    We will have to start sharding by users soon as we hit database size and write limits.

    Their Thoughts on Facebook Virality

  • Facebook models the social graph in digital form as accurately and completely as possible.
  • Social graph is more important that features.
  • Facebook enables rapid social distribution of new applications through the social graph.
  • Your application idea should be: social, engaging, and universal.
  • The social aspect makes it viral.
  • Engaging makes it monetizable.
  • Universal gives it potential.
  • Friends for Sale is social because you are buying and selling your social graph.
  • It's engaging because it's a twist on an idea, low pressure, flirty, and a bit cynical.
  • It's universal because everyone is vain, has a price, and wants to flirt with hot people.
  • Every touch point in the application is a potential for recruiting new users.
  • Every user converts 1.4 other users which is the basis for exponential growth.
  • For every new user track the number of invites, notifications, minifeed items, profile clicks, and other channels.
  • For every channel track the percent clicked, converted, uninstalls.

    Lessons Learned

  • Scaling from the start is a requirement on Facebook. They went to 1 million pages/day in 4 weeks.
  • Ruby on Rails can scale.
  • Anything scales on the right architecture. Focus on architecture and operations.
  • You need a good DBA, good host, and good well configured hardware.
  • With caching and the heavy duty servers available today, you can go a long time without adopting more complicated database architectures.
  • The social graph is real. It's truly staggering the number of accessible users on Facebook with the right well implemented viral application.
  • Most performance problems are in the database. Look to the database server, the database configuration, the query, or the use and non-use of indexes.
  • People still use Vi!

    I'd really like to thank Siqi taking the time to answer all my questions and provide this fascinating look in to their system. It's amazing what you've done in so little time. Excellent job and thanks again.
  • Wednesday
    Apr302008

    Rather small site architecture.

    Website stats:

    Webserver: Apache 2.2 Database: MySQL 5.0 APC cache for php CMS: Drupal 6.2 (bleeding-edge version)* *Aggressive caching is ON, Page Compression ON, Block Cache ON (can't use CCS),Optimize CSS/JS ON. 2 Servers: Apache/Mysql (low-tech servers - Celeron processors, 512 MB RAM, 7200 RPM HDD) Bandwidth 10 Mb/s

    The benchmark:

    Used ab : ab -n 1000 -c 20 howwhatwho.com Server Software: Apache/2.2.3 Server Hostname: howwhatwho.com Server Port: 80 Document Path: / Document Length: 41639 bytes Concurrency Level: 20 Time taken for tests: 13.556796 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 42118000 bytes HTML transferred: 41639000 bytes Requests per second: 73.76 [#/sec] (mean) Time per request: 271.136 [ms] (mean) Time per request: 13.557 [ms] (mean, across all concurrent requests) Transfer rate: 3033.90 [Kbytes/sec] received The Apache server is also running the postifx and bind although they aren't resource intensive applications. The Cron job for drupal runs every 50 minutes, and the agreggator module is enabled and fetches more than 30 rss feeds every time. The site used to be hosted on a single Celeron machine but on peak times the CPU went up to 80 %. Question : Does anybody know a website hosted on an IBM Mainframe? :) Todd?

    Click to read more ...

    Tuesday
    Apr292008

    Strategy: Sample to Reduce Data Set

    Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows "rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result." When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results. Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help. But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate. We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.

    Click to read more ...

    Tuesday
    Apr292008

    High performance file server

    What have bunch of applications which run on Debian servers, which processes huge amount of data stored in a shared NFS drive. we have 3 applications working as a pipeline, which process data stored in the NFS drive. The first application processes the data and store the output in some folder in the NFS drive, the second app in the pipeline process the data from the previous step and so on. The data load to the pipeline is like 1 GBytes per minute. I think the NFS drive is the bottleneck here. Would buying a specialized file server improve the performance of data read write from the disk ?

    Click to read more ...

    Wednesday
    Apr232008

    Behind The Scenes of Google Scalability

    The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities. Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed: • Hardware infrastructure • Distributed systems infrastructure: –Scheduling system –GFS –BigTable –MapReduce • Challenges and Future Directions –Infrastructure that spans all datacenters –More automation It is really like a "How does Google work" presentation in ~60 slides? Check out the slides and the video!

    Click to read more ...

    Tuesday
    Apr222008

    Simple NFS failover solution with symbolic link?

    I've been trying to find a high availability file storage solution without success. I tried GlusterFS which looks very promising but experienced problems with stability and don't want something I can't easily control and rely on. Other solutions are too complicated or have a SPOF. So I'm thinking of the following setup: Two NFS servers, a primary and a warm backup. The primary server will be rsynced with the warm backup every minute or two. I can do it so frequently as a PHP script will know which directories have changed recently from a database and only rsync those. Both servers will be NFS mounted on a cluster of web servers as /mnt/nfs-primary (sym linked as /home/websites) and /mnt/nfs-backup. I'll then use Ucarp (http://www.ucarp.org/project/ucarp) to monitor both NFS servers availability every couple of seconds and when one goes down, the Ucarp up script will be set to change the symbolic link on all web servers for the /home/websites dir from /mnt/nfs-primary to /mnt/nfs-backup The rsync script will then switch and the backup NFS will become primary and backup to the previous primary when it gets back online. Can it really be this simple or am I missing something? Just setting up a trial system now but would be interested in feedback. :) Also, I can't find out whether it's best to use NFS V3 or V4 these days?

    Click to read more ...

    Monday
    Apr212008

    Using Google AppEngine for a Little Micro-Scalability

    Over the years I've accumulated quite a rag tag collection of personal systems scattered wide across a galaxy of different servers. For the past month I've been on a quest to rationalize this conglomeration by moving everything to a managed service of one kind or another. The goal: lift a load of worry from my mind. I like to do my own stuff my self so I learn something and have control. Control always comes with headaches and it was time for a little aspirin. As part of the process GAE came in handy as a host for a few Twitter related scripts I couldn't manage to run anywhere else. I recoded my simple little scripts into Python/GAE and learned a lot in the process. In the move I exported HighScalability from a VPS and imported it into a shared hosting service. I could never quite configure Apache and MySQL well enough that they wouldn't spike memory periodically and crash the VPS. And since a memory crash did not automatically restarted it was unacceptable. I also wrote a script to convert a few thousand pages of JSPWiki to MediaWiki format, moved from my own mail server, moved all my code to a hosted SVN server, and moved a few other blogs and static sites along the way. No, it wasn’t very fun. One service I had a problem moving was http://innertwitter.com because of two scripts it used. In one script (Perl) I login to Twitter and download the most recent tweets for an account and display them on a web page. In another script (Java) I periodically post messages to various Twitter accounts. Without my own server I had nowhere to run these programs. I could keep a VPS but that would cost a lot and I would still have to worry about failure. I could use AWS but the cost of fault tolerant system would be too high for my meager needs. I could rewrite the functionality in PHP and use a shared hosting account, but I didn’t want to go down the PHP road. What to do? Then Google AppEngine announced and I saw an opportunity to kill two stones with one bird: learn something while doing something useful. With no Python skills I just couldn’t get started, so I ordered Learning Python by Mark Lutz. It arrived a few days later and I read it over an afternoon. I knew just enough Python to get started and that was all I needed. Excellent book, BTW. My first impression of Python is that it is a huge language. It gives you a full plate of functional and object oriented dishes and it will clearly take a while to digest. I’m pretty language agnostic so I’m not much of a fan boy of any language. A lot of people are quite passionate about Python. I don’t exactly understand why, but it looks like it does the job and that’s all I really care about. Basic Python skills in hand I run through the GAE tutorial. Shockingly it all just worked. They kept it very basic which is probably why it worked so well. With little ceremony I was able to create a site, access the database, register the application, upload the application, and then access it over the web. To get to the same point using AWS was *a lot* harder. Time to take off the training wheels. In the same way understanding a foreign language is a lot easier than speaking it, I found writing Python from scratch a lot harder than simply reading/editing it. I’m sure I’m committing all the classic noob mistakes. The indenting thing is a bit of a pain at first, but I like the resulting clean looking code. Not using semi-colons at the end of a line takes getting used to. I found the error messages none to helpful. Everything was a syntax error. Sorry folks, statically typed languages are still far superior in this regard. But the warm fuzzy feeling you get from changing code and immediately running it never gets old. My first task was to get recent entries from my Twitter account. My original Perl code looks like:

    use strict;
    use warnings;
    use CGI;
    use LWP;
    eval 
    { 
       my $query = new CGI;
       print $query->header;
       my $callback= $query->param("callback");
       my $url= "http://twitter.com/statuses/replies.json";
       my $ua= new LWP::UserAgent;
       $ua->agent("InnerTwitter/0.1" . $ua->agent);
       my $header= new HTTP::Headers;
       $header->authorization_basic("user", "password");
       my $req= new HTTP::Request("GET", $url, $header); 
       my $res= $ua->request($req);
       if ($res->is_success) 
       { print "$callback(" . $res->content . ")"; } 
       else 
       {
          my $msg= $res->error_as_HTML();
          print $msg;
       }
    };
    
    My strategy was to try and do a pretty straightforward replacement of Perl with Python. From my reading URL fetch was what I needed to make the json call. Well, the documentation for URL fetch is nearly useless. There’s not a practical “help get stuff done” line in it. How do I perform authorization, for example? Eventually I hit on:
    class InnerTwitter(webapp.RequestHandler):
       def get(self):
          self.response.headers['Content-Type'] = 'text/plain'
          callback =  self.request.get("callback")
          base64string = base64.encodestring('%s:%s' % ("user", "password"))[:-1]
          headers = {'Authorization': "Basic %s" % base64string} 
          url = "http://twitter.com/statuses/replies.json";
          result = urlfetch.fetch(url, method=urlfetch.GET, headers=headers)
          self.response.out.write(callback + "(" + result.content + ")")
    
    def main():
      application = webapp.WSGIApplication(
                                           [('/innertwitter', InnerTwitter)],
                                           debug=True)
    
    For me the Perl code was easier simply because there is example code everywhere. Perhaps Python programmers already know all this stuff so it’s easier for them. I eventually figured out all the WSGI stuff is standard and there was doc available. Once I figured out what I needed to do the code is simple and straightforward. The one thing I really dislike is passing self around. It just indicates bolt-on to me, but other than that I like it. I also like the simple mapping of URL to handler. As an early CGI user I could never quite understand why more moderns need a framework to “route” to URL handlers. This approach hits just the right level of abstraction to me. My next task was to write a string to a twitter account. Here’s my original java code:
    private static void sendTwitter(String username)
        {
            username+= "@domain.com";
            String password = "password";
            
            try
            {
                String chime= getChimeForUser(username);
                String msg= "status=" + URLEncoder.encode(chime);
                msg+= "&source=innertwitter";
                URL url = new URL("http://twitter.com/statuses/update.xml");
                URLConnection conn = url.openConnection();
                conn.setDoOutput(true); // set POST
                conn.setUseCaches (false);
                conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
                conn.setRequestProperty("CONTENT_LENGTH", "" + msg.length()); 
                String credentials = new sun.misc.BASE64Encoder().encode((username
                        + ":" + password).getBytes());
                conn.setRequestProperty("Authorization", "Basic " + credentials);
                OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
                wr.write(msg);
                wr.flush();
                wr.close();
                BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                String line = "";
                while ((line = rd.readLine()) != null)
                {
                    System.out.println(line);
                }
    
            } catch (Exception e)
            {
                e.printStackTrace();
            }
        }
    
        private static String getChimeForUser(String username)
        {
            Date date = new Date();
            Format formatter = new SimpleDateFormat("........hh:mm EEE, MMM d");
            String chime= "........*chime*                       " + formatter.format(date);
            return chime;
        }
    
    Here’s my Python translation:
    class SendChime(webapp.RequestHandler):
       def get(self):
          self.response.headers['Content-Type'] = 'text/plain'
          username =  self.request.get("username")
    
          login =  username
          password = "password"
          chime = self.get_chime()
          payload= {'status' : chime,  'source' : "innertwitter"}
          payload= urllib.urlencode(payload)
    
          base64string = base64.encodestring('%s:%s' % (login, password))[:-1]
          headers = {'Authorization': "Basic %s" % base64string} 
    
          url = "http://twitter.com/statuses/update.xml"
          result = urlfetch.fetch(url, payload=payload, method=urlfetch.POST, headers=headers)
    
          self.response.out.write(result.content)
    
       def get_chime(self):
          now = datetime.datetime.now()
          chime = "........*chime*.............." + now.ctime()
          return chime
     
    def main():
      application = webapp.WSGIApplication(
                                           [('/innertwitter', InnerTwitter),
                                           ('/sendchime', SendChime)],
                                           debug=True)
    
    I had to drive the timed execution of this URL from an external cron service, which points out that GAE is still a very limited environment. Start to finish the coding took me 4 hours and the scripts are now running in production. Certainly this is not a complex application in any sense, but I was happy it never degenerated into the all too familiar debug fest where you continually fight infrastructure problems and don’t get anything done. I developed code locally and it worked. I pushed code into the cloud and it worked. Nice. Most of my time was spent trying to wrap my head around how you code standard HTTP tasks in Python/GAE. The development process went smoothly. The local web server and the deployment environment seemed to be in harmony. And deploying the local site into Google’s cloud went without a hitch. The debugging environment is primitive, but I imagine that will improve over time. This wasn’t merely a programming exercise for an overly long and boring post. I got some real value out of this:
  • Hosting for my programs. I didn’t have any great alternatives to solve my hosting problem and GAE fit a nice niche for me.
  • Free. I wouldn’t really mind if it was low cost, but since most of my stuff never makes money I need to be frugal.
  • Scalable. I don’t have to worry about overloading the service.
  • Reliable. I don’t have to worry about the service going down and people not seeing their tweets or getting their chimes.
  • Simple. The process was very simple and developer friendly. AWS will be the way to go for “real” apps, but for simpler apps a lighter weight approach is refreshing. One can see the GUI layer in GAE and the service layer in AWS. GAE offers a kind of micro-scalability. All the little things you didn’t have a place to put before can now find a home. And as they grow up they might just find they like staying around for a little of momma’s home cooking.

    Related Articles

  • How SimpleDB Differs from a RDBMS
  • Google AppEngine – A Second Look
  • Is App Tone Enough? at Appistry.

    Click to read more ...

  • Monday
    Apr212008

    The Search for the Source of Data - How SimpleDB Differs from a RDBMS

    Update 2: Yurii responds with the Top 10 Reasons to Avoid Document Databases FUD. Update: Top 10 Reasons to Avoid the SimpleDB Hype by Ryan Park provides a well written counter take. Am I really that fawning? If so, doesn't that make me a dear? All your life you've used a relational database. At the tender age of five you banged out your first SQL query to track your allowance. Your RDBMS allegiance was just assumed, like your politics or religion would have been assumed 100 years ago. They now say--you know them--that relations won't scale and we have to do things differently. New databases like SimpleDB and BigTable are what's different. As a long time RDBMS user what can you expect of SimpleDB? That's what Alex Tolley of MyMeemz.com set out to discover. Like many brave explorers before him, Alex gave a report of his adventures to the Royal Society of the AWS Meetup. Alex told a wild almost unbelievable tale of cultures and practices so different from our own you almost could not believe him. But Alex brought back proof. Using a relational database is a no-brainer when you have a big organization behind you. Someone else worries about the scaling, the indexing, backups, and so on. When you are out on your own there's no one to hear you scream when your site goes down. In these circumstances you just want a database that works and that you never have to worry about again. That's what attracted Alex to SimpleDB. It's trivial to setup and use, no schema required, insert data on the fly with no upfront preparation, and it will scale with no work on your part. You become free from DIAS (Database Induced Anxiety Syndrome). You don't have to think about or babysit your database anymore. It will just work. And from a business perspective your database becomes a variable cost rather than a high fixed cost, which is excellent for the angel food funding. Those are very nice features in a database. But for those with a relational database background there are some major differences that take getting used to. No schema. You don't have to define a schema before you use the database. SimpleDB is an attribute-value store and you can use any you like any time you like. It doesn't care. Very different from Victorian world of the RDBMS. No joins. In relational theory the goal is to minimize update and deletion anomolies by normaling your data into seperate tables related by keys. You then join those tables together when you need the data back. In SimpleDB there are no joins. For many-to-1 relationships this works out great. In SimpleDB attribute values can have multiple values so there's no need to do a join to recover all the values. They are stored together. For many-to-many to relationships life is not so simple. You must code them by hand in your program. This is a common theme in SimpleDB. What the RDBMS does for you automatically must generally be coded by hand with SimpleDB. The wages of scale are more work for the programmer. What a surprise. Two step query process. In a RDBMS you can select which columns are returned in a query. Not so in SimpleDB. In a query SimpleDB just returns back a record ID, not the values of the record. You need to make another trip to the database to get the record contents. So to minimize your latency you would need to spawn off multiple threads. See, more work for the programmer. No sorting. Records are not returned in a sorted order. Values for multi-value attribute fields are not returned in sorted order. That means if you want sorted results you must do the sorting. And it also means you must get all the results back before you can do the sorting. More work for the programmer. Broken cursor. SimpleDB only returns back 250 results at a time. When there are more results you cursor through the result set using a token mechanism. The kicker is you must iterate through the result set sequentially. So iterating through a large result set will take a while. And you can't use your secret EC2 weapon of massive cheap CPU to parallelize the process. More work for the programmer because you have to move logic to the write part of the process instead of the read part because you'll never be able to read fast enough to perform your calculations in a low latency environment. The promise of scaling is fulfilled. Alex tested retrieving 10 record ids from 3 different database sizes. Using a 1K record database it took an average of 141 msecs to retrieve the 10 record ids. For a 100K record database it took 266 msecs on average. For a 1000K record database it took an average of 433 msecs to retrieve the 10 record ids. It's not fast, but it is relatively consistent. That seems to be a theme with these databases. BigTable isn't exactly a speed demon either. One could conclude that for certain needs at least, SimpleDB scales sufficiently well that you can feel comfortable that your database won't bottleneck your system or cause it to crash under load. If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do. SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I'll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don't even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy. What programmers can't do is make the database scalable through more programming. Making a database scalable is not a solvable problem through more programming. So for programmers the right trade off was made. A scalable database you don't have to worry about for more programming work you already know how to do. How does that sound?

    Related Articles

  • The new attack on the RDBMS by techno.blog("Dion")
  • The End of an Architectural Era (It’s Time for a Complete Rewrite) - A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube.

    Click to read more ...

  • Monday
    Apr212008

    Google App Engine - what about existing applications?

    Recently, Google announced Google App Engine, another announcement in the rapidly growing world of cloud computing. This brings up some very serious questions: 1. If we want to take advantage of one of the clouds, are we doomed to be locked-in for life? 2. Must we re-write our existing applications to use the cloud? 3. Do we need to learn a brand new technology or language for the cloud? This post presents a pattern that will enable us to abstract our application code from the underlying cloud provider infrastructure. This will enable us to easily migrate our EXISTING applications to cloud based environment thus avoiding the need for a complete re-write.

    Click to read more ...