General Discussion

How to build a real-time analytics system?

Hello everybody!

I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day.

The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only.

But this is not a problem since this configuration is working for us and we don't have any problems with it.

Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfied with any of them.

We want to build something like Google Analytics so it would allow us to create reports in real-time with "drill-down" possibility to make interactive reports. We don't need real-time data to be included in report - we just need a possibility to make different reports very fast. Data can be pre-processed.

For example right now we are logging requests into plain text log files in the following format:
date | hour | user_id | site_id | action_id | some_other_attributes..

There are about 8 - 9 million requests per day and we want to make real-time reports for example:
- number of hits per day (the simplest)
- number of hits by unique users per day
- number of hits by unique users on specific site per day
- number of distinct actions by users on specific site during defined period (e.g. one month, period of X months...)
etc.

You can display any type of report by combining different columns as well as counting all or only distinct occurrences of certain attributes.

I know how to parse these log files and calculate any type of report I want, but it takes time. There are about 9 million rows in each daily log file and if I want to calculate monthly reports I need to parse all daily log files for one month - meaning I have to parse almost 300 million of lines, count what I want and then display the summary. This can take for hours and sometimes it has to be done in more than one step (e.g. calculating a number of users that have been on site_id=1 but not on site_id=2 - in this case I have to export users on site 1, export users on site 2 and then compare results and count the differences).

If you take a look at Google Analytics it calculates any type of similar report in real-time. How do they do it? How can someone form a database that could do something like that? If I put 300 million of rows (requests per month) into the Postgres/MySQL table, selects are even slower than parsing plain text log files using Perl...

I am aware that they have a huge amount of servers but I am also aware that they have even bigger amount of hits per day. I have a possibility to store and process this kind of analytics on multiple servers at the same time but I don't have enough knowledge how to construct a software and database that would be able to do a job like this.

Does somebody have any suggestion? A simple example would be great! We already managed to make some sort of a database for site_id+action_id drilldown but the problem is with "unique users" which is THE information that we need all the time. To calculate unique users during certain period you have to count all the distinct user_ids during that time period. E.g.: select count(distinct user_id) from ... where date>='2008-04-10' and date <='2008-04-18' - with a 9million rows per day this statement would take about two minutes to complete and we are not satisfied with it.

Thank you for any hint!

How to build a real-time analytics system?

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale