Strategy: Sample to Reduce Data Set

Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows "rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result." When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results. Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help. But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident you are in your estimate. We generally like exact numbers. But if running a report takes an entire day because the data set is so large, then taking a sample is an excellent way to scale.

Click to read more ...


High performance file server

What have bunch of applications which run on Debian servers, which processes huge amount of data stored in a shared NFS drive. we have 3 applications working as a pipeline, which process data stored in the NFS drive. The first application processes the data and store the output in some folder in the NFS drive, the second app in the pipeline process the data from the previous step and so on. The data load to the pipeline is like 1 GBytes per minute. I think the NFS drive is the bottleneck here. Would buying a specialized file server improve the performance of data read write from the disk ?

Click to read more ...


Behind The Scenes of Google Scalability

The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities. Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed: • Hardware infrastructure • Distributed systems infrastructure: –Scheduling system –GFS –BigTable –MapReduce • Challenges and Future Directions –Infrastructure that spans all datacenters –More automation It is really like a "How does Google work" presentation in ~60 slides? Check out the slides and the video!

Click to read more ...


Simple NFS failover solution with symbolic link?

I've been trying to find a high availability file storage solution without success. I tried GlusterFS which looks very promising but experienced problems with stability and don't want something I can't easily control and rely on. Other solutions are too complicated or have a SPOF. So I'm thinking of the following setup: Two NFS servers, a primary and a warm backup. The primary server will be rsynced with the warm backup every minute or two. I can do it so frequently as a PHP script will know which directories have changed recently from a database and only rsync those. Both servers will be NFS mounted on a cluster of web servers as /mnt/nfs-primary (sym linked as /home/websites) and /mnt/nfs-backup. I'll then use Ucarp ( to monitor both NFS servers availability every couple of seconds and when one goes down, the Ucarp up script will be set to change the symbolic link on all web servers for the /home/websites dir from /mnt/nfs-primary to /mnt/nfs-backup The rsync script will then switch and the backup NFS will become primary and backup to the previous primary when it gets back online. Can it really be this simple or am I missing something? Just setting up a trial system now but would be interested in feedback. :) Also, I can't find out whether it's best to use NFS V3 or V4 these days?

Click to read more ...


Using Google AppEngine for a Little Micro-Scalability

Over the years I've accumulated quite a rag tag collection of personal systems scattered wide across a galaxy of different servers. For the past month I've been on a quest to rationalize this conglomeration by moving everything to a managed service of one kind or another. The goal: lift a load of worry from my mind. I like to do my own stuff my self so I learn something and have control. Control always comes with headaches and it was time for a little aspirin. As part of the process GAE came in handy as a host for a few Twitter related scripts I couldn't manage to run anywhere else. I recoded my simple little scripts into Python/GAE and learned a lot in the process. In the move I exported HighScalability from a VPS and imported it into a shared hosting service. I could never quite configure Apache and MySQL well enough that they wouldn't spike memory periodically and crash the VPS. And since a memory crash did not automatically restarted it was unacceptable. I also wrote a script to convert a few thousand pages of JSPWiki to MediaWiki format, moved from my own mail server, moved all my code to a hosted SVN server, and moved a few other blogs and static sites along the way. No, it wasn’t very fun. One service I had a problem moving was because of two scripts it used. In one script (Perl) I login to Twitter and download the most recent tweets for an account and display them on a web page. In another script (Java) I periodically post messages to various Twitter accounts. Without my own server I had nowhere to run these programs. I could keep a VPS but that would cost a lot and I would still have to worry about failure. I could use AWS but the cost of fault tolerant system would be too high for my meager needs. I could rewrite the functionality in PHP and use a shared hosting account, but I didn’t want to go down the PHP road. What to do? Then Google AppEngine announced and I saw an opportunity to kill two stones with one bird: learn something while doing something useful. With no Python skills I just couldn’t get started, so I ordered Learning Python by Mark Lutz. It arrived a few days later and I read it over an afternoon. I knew just enough Python to get started and that was all I needed. Excellent book, BTW. My first impression of Python is that it is a huge language. It gives you a full plate of functional and object oriented dishes and it will clearly take a while to digest. I’m pretty language agnostic so I’m not much of a fan boy of any language. A lot of people are quite passionate about Python. I don’t exactly understand why, but it looks like it does the job and that’s all I really care about. Basic Python skills in hand I run through the GAE tutorial. Shockingly it all just worked. They kept it very basic which is probably why it worked so well. With little ceremony I was able to create a site, access the database, register the application, upload the application, and then access it over the web. To get to the same point using AWS was *a lot* harder. Time to take off the training wheels. In the same way understanding a foreign language is a lot easier than speaking it, I found writing Python from scratch a lot harder than simply reading/editing it. I’m sure I’m committing all the classic noob mistakes. The indenting thing is a bit of a pain at first, but I like the resulting clean looking code. Not using semi-colons at the end of a line takes getting used to. I found the error messages none to helpful. Everything was a syntax error. Sorry folks, statically typed languages are still far superior in this regard. But the warm fuzzy feeling you get from changing code and immediately running it never gets old. My first task was to get recent entries from my Twitter account. My original Perl code looks like:

use strict;
use warnings;
use CGI;
use LWP;
   my $query = new CGI;
   print $query->header;
   my $callback= $query->param("callback");
   my $url= "";
   my $ua= new LWP::UserAgent;
   $ua->agent("InnerTwitter/0.1" . $ua->agent);
   my $header= new HTTP::Headers;
   $header->authorization_basic("user", "password");
   my $req= new HTTP::Request("GET", $url, $header); 
   my $res= $ua->request($req);
   if ($res->is_success) 
   { print "$callback(" . $res->content . ")"; } 
      my $msg= $res->error_as_HTML();
      print $msg;
My strategy was to try and do a pretty straightforward replacement of Perl with Python. From my reading URL fetch was what I needed to make the json call. Well, the documentation for URL fetch is nearly useless. There’s not a practical “help get stuff done” line in it. How do I perform authorization, for example? Eventually I hit on:
class InnerTwitter(webapp.RequestHandler):
   def get(self):
      self.response.headers['Content-Type'] = 'text/plain'
      callback =  self.request.get("callback")
      base64string = base64.encodestring('%s:%s' % ("user", "password"))[:-1]
      headers = {'Authorization': "Basic %s" % base64string} 
      url = "";
      result = urlfetch.fetch(url, method=urlfetch.GET, headers=headers)
      self.response.out.write(callback + "(" + result.content + ")")

def main():
  application = webapp.WSGIApplication(
                                       [('/innertwitter', InnerTwitter)],
For me the Perl code was easier simply because there is example code everywhere. Perhaps Python programmers already know all this stuff so it’s easier for them. I eventually figured out all the WSGI stuff is standard and there was doc available. Once I figured out what I needed to do the code is simple and straightforward. The one thing I really dislike is passing self around. It just indicates bolt-on to me, but other than that I like it. I also like the simple mapping of URL to handler. As an early CGI user I could never quite understand why more moderns need a framework to “route” to URL handlers. This approach hits just the right level of abstraction to me. My next task was to write a string to a twitter account. Here’s my original java code:
private static void sendTwitter(String username)
        username+= "";
        String password = "password";
            String chime= getChimeForUser(username);
            String msg= "status=" + URLEncoder.encode(chime);
            msg+= "&source=innertwitter";
            URL url = new URL("");
            URLConnection conn = url.openConnection();
            conn.setDoOutput(true); // set POST
            conn.setUseCaches (false);
            conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            conn.setRequestProperty("CONTENT_LENGTH", "" + msg.length()); 
            String credentials = new sun.misc.BASE64Encoder().encode((username
                    + ":" + password).getBytes());
            conn.setRequestProperty("Authorization", "Basic " + credentials);
            OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
            BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line = "";
            while ((line = rd.readLine()) != null)

        } catch (Exception e)

    private static String getChimeForUser(String username)
        Date date = new Date();
        Format formatter = new SimpleDateFormat("........hh:mm EEE, MMM d");
        String chime= "........*chime*                       " + formatter.format(date);
        return chime;
Here’s my Python translation:
class SendChime(webapp.RequestHandler):
   def get(self):
      self.response.headers['Content-Type'] = 'text/plain'
      username =  self.request.get("username")

      login =  username
      password = "password"
      chime = self.get_chime()
      payload= {'status' : chime,  'source' : "innertwitter"}
      payload= urllib.urlencode(payload)

      base64string = base64.encodestring('%s:%s' % (login, password))[:-1]
      headers = {'Authorization': "Basic %s" % base64string} 

      url = ""
      result = urlfetch.fetch(url, payload=payload, method=urlfetch.POST, headers=headers)


   def get_chime(self):
      now =
      chime = "........*chime*.............." + now.ctime()
      return chime
def main():
  application = webapp.WSGIApplication(
                                       [('/innertwitter', InnerTwitter),
                                       ('/sendchime', SendChime)],
I had to drive the timed execution of this URL from an external cron service, which points out that GAE is still a very limited environment. Start to finish the coding took me 4 hours and the scripts are now running in production. Certainly this is not a complex application in any sense, but I was happy it never degenerated into the all too familiar debug fest where you continually fight infrastructure problems and don’t get anything done. I developed code locally and it worked. I pushed code into the cloud and it worked. Nice. Most of my time was spent trying to wrap my head around how you code standard HTTP tasks in Python/GAE. The development process went smoothly. The local web server and the deployment environment seemed to be in harmony. And deploying the local site into Google’s cloud went without a hitch. The debugging environment is primitive, but I imagine that will improve over time. This wasn’t merely a programming exercise for an overly long and boring post. I got some real value out of this:
  • Hosting for my programs. I didn’t have any great alternatives to solve my hosting problem and GAE fit a nice niche for me.
  • Free. I wouldn’t really mind if it was low cost, but since most of my stuff never makes money I need to be frugal.
  • Scalable. I don’t have to worry about overloading the service.
  • Reliable. I don’t have to worry about the service going down and people not seeing their tweets or getting their chimes.
  • Simple. The process was very simple and developer friendly. AWS will be the way to go for “real” apps, but for simpler apps a lighter weight approach is refreshing. One can see the GUI layer in GAE and the service layer in AWS. GAE offers a kind of micro-scalability. All the little things you didn’t have a place to put before can now find a home. And as they grow up they might just find they like staying around for a little of momma’s home cooking.

    Related Articles

  • How SimpleDB Differs from a RDBMS
  • Google AppEngine – A Second Look
  • Is App Tone Enough? at Appistry.

    Click to read more ...

  • Monday

    The Search for the Source of Data - How SimpleDB Differs from a RDBMS

    Update 2: Yurii responds with the Top 10 Reasons to Avoid Document Databases FUD. Update: Top 10 Reasons to Avoid the SimpleDB Hype by Ryan Park provides a well written counter take. Am I really that fawning? If so, doesn't that make me a dear? All your life you've used a relational database. At the tender age of five you banged out your first SQL query to track your allowance. Your RDBMS allegiance was just assumed, like your politics or religion would have been assumed 100 years ago. They now say--you know them--that relations won't scale and we have to do things differently. New databases like SimpleDB and BigTable are what's different. As a long time RDBMS user what can you expect of SimpleDB? That's what Alex Tolley of set out to discover. Like many brave explorers before him, Alex gave a report of his adventures to the Royal Society of the AWS Meetup. Alex told a wild almost unbelievable tale of cultures and practices so different from our own you almost could not believe him. But Alex brought back proof. Using a relational database is a no-brainer when you have a big organization behind you. Someone else worries about the scaling, the indexing, backups, and so on. When you are out on your own there's no one to hear you scream when your site goes down. In these circumstances you just want a database that works and that you never have to worry about again. That's what attracted Alex to SimpleDB. It's trivial to setup and use, no schema required, insert data on the fly with no upfront preparation, and it will scale with no work on your part. You become free from DIAS (Database Induced Anxiety Syndrome). You don't have to think about or babysit your database anymore. It will just work. And from a business perspective your database becomes a variable cost rather than a high fixed cost, which is excellent for the angel food funding. Those are very nice features in a database. But for those with a relational database background there are some major differences that take getting used to. No schema. You don't have to define a schema before you use the database. SimpleDB is an attribute-value store and you can use any you like any time you like. It doesn't care. Very different from Victorian world of the RDBMS. No joins. In relational theory the goal is to minimize update and deletion anomolies by normaling your data into seperate tables related by keys. You then join those tables together when you need the data back. In SimpleDB there are no joins. For many-to-1 relationships this works out great. In SimpleDB attribute values can have multiple values so there's no need to do a join to recover all the values. They are stored together. For many-to-many to relationships life is not so simple. You must code them by hand in your program. This is a common theme in SimpleDB. What the RDBMS does for you automatically must generally be coded by hand with SimpleDB. The wages of scale are more work for the programmer. What a surprise. Two step query process. In a RDBMS you can select which columns are returned in a query. Not so in SimpleDB. In a query SimpleDB just returns back a record ID, not the values of the record. You need to make another trip to the database to get the record contents. So to minimize your latency you would need to spawn off multiple threads. See, more work for the programmer. No sorting. Records are not returned in a sorted order. Values for multi-value attribute fields are not returned in sorted order. That means if you want sorted results you must do the sorting. And it also means you must get all the results back before you can do the sorting. More work for the programmer. Broken cursor. SimpleDB only returns back 250 results at a time. When there are more results you cursor through the result set using a token mechanism. The kicker is you must iterate through the result set sequentially. So iterating through a large result set will take a while. And you can't use your secret EC2 weapon of massive cheap CPU to parallelize the process. More work for the programmer because you have to move logic to the write part of the process instead of the read part because you'll never be able to read fast enough to perform your calculations in a low latency environment. The promise of scaling is fulfilled. Alex tested retrieving 10 record ids from 3 different database sizes. Using a 1K record database it took an average of 141 msecs to retrieve the 10 record ids. For a 100K record database it took 266 msecs on average. For a 1000K record database it took an average of 433 msecs to retrieve the 10 record ids. It's not fast, but it is relatively consistent. That seems to be a theme with these databases. BigTable isn't exactly a speed demon either. One could conclude that for certain needs at least, SimpleDB scales sufficiently well that you can feel comfortable that your database won't bottleneck your system or cause it to crash under load. If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do. SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I'll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don't even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy. What programmers can't do is make the database scalable through more programming. Making a database scalable is not a solvable problem through more programming. So for programmers the right trade off was made. A scalable database you don't have to worry about for more programming work you already know how to do. How does that sound?

    Related Articles

  • The new attack on the RDBMS by"Dion")
  • The End of an Architectural Era (It’s Time for a Complete Rewrite) - A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube.

    Click to read more ...

  • Monday

    Google App Engine - what about existing applications?

    Recently, Google announced Google App Engine, another announcement in the rapidly growing world of cloud computing. This brings up some very serious questions: 1. If we want to take advantage of one of the clouds, are we doomed to be locked-in for life? 2. Must we re-write our existing applications to use the cloud? 3. Do we need to learn a brand new technology or language for the cloud? This post presents a pattern that will enable us to abstract our application code from the underlying cloud provider infrastructure. This will enable us to easily migrate our EXISTING applications to cloud based environment thus avoiding the need for a complete re-write.

    Click to read more ...


    How to build a real-time analytics system?

    Hello everybody! I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. But this is not a problem since this configuration is working for us and we don't have any problems with it. Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfied with any of them. We want to build something like Google Analytics so it would allow us to create reports in real-time with "drill-down" possibility to make interactive reports. We don't need real-time data to be included in report - we just need a possibility to make different reports very fast. Data can be pre-processed. For example right now we are logging requests into plain text log files in the following format: date | hour | user_id | site_id | action_id | some_other_attributes.. There are about 8 - 9 million requests per day and we want to make real-time reports for example: - number of hits per day (the simplest) - number of hits by unique users per day - number of hits by unique users on specific site per day - number of distinct actions by users on specific site during defined period (e.g. one month, period of X months...) etc. You can display any type of report by combining different columns as well as counting all or only distinct occurrences of certain attributes. I know how to parse these log files and calculate any type of report I want, but it takes time. There are about 9 million rows in each daily log file and if I want to calculate monthly reports I need to parse all daily log files for one month - meaning I have to parse almost 300 million of lines, count what I want and then display the summary. This can take for hours and sometimes it has to be done in more than one step (e.g. calculating a number of users that have been on site_id=1 but not on site_id=2 - in this case I have to export users on site 1, export users on site 2 and then compare results and count the differences). If you take a look at Google Analytics it calculates any type of similar report in real-time. How do they do it? How can someone form a database that could do something like that? If I put 300 million of rows (requests per month) into the Postgres/MySQL table, selects are even slower than parsing plain text log files using Perl... I am aware that they have a huge amount of servers but I am also aware that they have even bigger amount of hits per day. I have a possibility to store and process this kind of analytics on multiple servers at the same time but I don't have enough knowledge how to construct a software and database that would be able to do a job like this. Does somebody have any suggestion? A simple example would be great! We already managed to make some sort of a database for site_id+action_id drilldown but the problem is with "unique users" which is THE information that we need all the time. To calculate unique users during certain period you have to count all the distinct user_ids during that time period. E.g.: select count(distinct user_id) from ... where date>='2008-04-10' and date <='2008-04-18' - with a 9million rows per day this statement would take about two minutes to complete and we are not satisfied with it. Thank you for any hint!

    Click to read more ...


    Scaling Mania at MySQL Conference 2008

    The 2008 MySQL Conference & Expo has now closed, but what is still open for viewing is all the MySQL scaling knowledge that was shared. Planet MySQL is a great source of the goings on:

  • Scaling out MySQL: Hardware today and tomorrow by Jeremy Cole and Eric Bergen of Proven Scaling. In it are answered all the big questions of life: What about 64-bit? How many cores? How much memory? Shared storage? Finally we learn the secrets of true happiness.
  • Panel Video: Scaling MySQL? Up or Out?. Don't have time? Take a look at the Diamond Note excellent game day summary. Companies like MySQL, Sun, Flickr, Fotolog, Wikipedia, Facebook and YouTube share intel on how many web servers they have, how they handle failure, and how they scale.
  • Kevin Burton in Scaling MySQL and Java in High Write Throughput Environments - How we built Spinn3r shows how they crawl and index 500k posts per hour using MySQL and 40 servers.
  • Venu Anuganti channels Dathan Pattishall's talk on scaling heavy concurrent writes in real time.
  • This time Venu channels Helping InnoDB scale on servers with many cores by Mark Callaghan from Google.
  • Exploring Amazon EC2 for Scale-out Applications by Morgan Tocker, MySQL Canada, Carl Mercier, Defensio. RoR based spam filtering services that runs completely on EC2. Show evolution from a simple configuration to a sharded architecture.
  • Applied Partitioning and Scaling Your (OLTP) Database System by Phil Hilderbrand.
  • Real World Web: Performance & Scalability by Ask Bjorn Hansen. (189 slides!). He promises you haven't seen this talk before. The secret: Think Horizontal.
  • Too many to list here. All the presentations are available on scribd.

    Click to read more ...

  • Thursday

    Mysql scalability and failover...

    Hi, I am an owner of an large community website and currently we are having problems with our database architecture. We are using 2 database servers and spread tables across them to divide read/writes. We have about 90% reads and 10% writes. We use Memcached on all our webservers to cache as much as we can, traffic is load balanced between webservers. We have 2 extra servers ready to put to use! We have looked into a couple of solution so far: Continuent Uni/Cluster aka Sequoia -> Commercial version way too expensive and Java isn't as fast as it suppose to be. MySQL Proxy -> We couldn't find any good example on how to create a master - master with failover scenario. MySQL Clustering -> Seems to be not mature enough, had a lot of performance issues when we tried to go online with it. MySQL DRDB HA -> Only good for failover, cannot be scaled! MySQL Replication -> Well don't get me started ;) So now I turn to you guys to help me out, I am with my hands in my hair and see the site still growning and performance slowly getting to its limit. Really need your help!! HELP!

    Click to read more ...