Poppen.de Architecture

This is a guest a post by Alvaro Videla describing their architecture for Poppen.de, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is Poppen.de?

Poppen.de (NSFW) is the top dating website in  Germany, and while it may be a small site compared to giants like Flickr  or Facebook, we believe it's a nice architecture to learn from if you  are starting to get some scaling problems.

The  Stats

  • 2.000.000 users
  • 20.000  concurrent users
  • 300.000 private messages per day
  • 250.000  logins per day
  • We have a team  of  eleven developers, two designers and two sysadmins for this project.

Business Model

The site works with a  freemium model, where users can do for free things like:

  • Search for other users.
  • Write private messages to each  other.
  • Upload pictures and videos.
  • Have friends.
  • Video Chat.
  • Much more…

If they  want to send unlimited messages or have unlimited picture uploads then  they can pay for different kinds of membership according to their needs.  The same applies for the video chat and other parts of the web site.

Toolbox

Nginx

All  our site is served via Nginx. We have 2 frontend Nginx servers  delivering 150.000 requests to www.poppen.de per minute during peak time. They are  four years old machines with one CPU and 3GB of RAM each. Then we have  separate machines to serve the site images. There are 80.000 requests  per minute to *.bilder.poppen.de (image servers) served by 3 nginx servers.

One  of the cool things that Nginx lets us do is to deliver many requests  out of Memcached, without the need of hitting the PHP machines to get  content that is already cached. So for example, the users profiles are  one of the most CPU intensive pages in the site. Once the profile has  been requested we cache the whole content on Memcached. Then the Nginx  server will hit the Memcached and deliver the content from there. There  are 8000 requests per minute delivered out of the Memcached.

We  have 3 Nginx servers that are delivering the images from a local cache.  The users upload their pictures to a central file server. A picture  request will then hit one of the 3 Nginx servers. If the picture is not  in the local cache filesystem, the Nginx will download the picture from  the central server, store in its local cache and serve it. This lets us  load balance the image distribution and alleviate the load in the main  storage machine.

PHP-FPM

The  site is running on PHP-FPM. We use 28 PHP machines with two CPUs and  6GB of memory each. They run 100 PHP-FPM worker processes each. We use  PHP 5.3.x with APC enabled. The 5.3 version of PHP allowed us to reduce  30%+ of both CPU and Memory usage.

The code is  written using the symfony 1.2 PHP framework. On one hand this means  extra resource footprint, on the other hand it gives us speed of  development and a well know framework that lets us integrate new  developers to the team with ease. Not everything is "Flowers and Roses"  here. So while we have a lot of advantages provided by the framework, we  had to tweak it a lot to get it up to the task of serving www.poppen.de.

What we  did was profile the site using XHProf –Facebook's profiling library for  PHP– and then optimize the bottlenecks. Thanks to the fact that the  framework is easy to customize and configure, we were able to cache most  of the expensive calculations that were adding extra load to the  servers in APC.

MySQL

MySQL  is our main RDBMS. We have several MySQL servers:  A  32GB machine with 4 CPUs storing all the users related information,  like profiles, pictures metadata, etc. This machine is 4 years old. We  are planning to replace it by a sharded cluster. We are still working on  the design of this system, trying to have a low impact in our data  access code. We want to partition the data by user id, since most of the  information on the site is centered on the user itself, like images,  videos, messages, etc.

We  have 3 machines working in a master-slave-slave configuration for the  users' forum. Then there's a cluster of servers  that runs as storage for the web site custom message system. Currently  it has more than 250 million messages. They are 4 machines configured in  a master slave master/slave slave system.

We  also have an NDB cluster composed by 4 machines for write intensive  data, like the statistics of which user visited which other user's  profile.

We try to avoid joins like the plague  and cache as much as possible. The datastructure is heavily  denormalized. For that we have created summary tables, to ease  searching.

Most of the tables are MyISAM which  provides fast lookups. The problem we are seeing more and more are full  table locks. We are moving to the XtraDB storage engine.

Memcached

We  use Memcached heavily. We have 45 GB of cache over 51 nodes. We use it  for Session Storage, View Cache –like for user profiles–, and Function  Execution cache –like queries–, etc. Most of the queries by primary key  that we have to the users table are cached in Memcached and then  delivered from there. We have a system that lets automatically  invalidate the cache every time one record of that table is modified.  One possible solution to improve cache invalidation in the future is to  use the new Redis Hash API or MongoDB. With those databases we can  update the cache with enough granularity to not need to invalidate it.

RabbitMQ

Since  mid 2009 we introduced RabbitMQ into our stack. It's been a solution  that was easy to deploy and integrate with our system. We run two  RabbitMQ servers behind LVS. During the last month we have been moving  more and more stuff to the queue, meaning that at the moment the 28 PHP  frontend machines are publishing around 500.000 jobs per day. We send  logs, email notifications, system messages, image uploads, and much more  to the queue.

To enqueue messages we use one  of the coolest features provided by PHP-FPM which is the  fastcgi_finish_request() function. This allows us to send messages to  the queue in an asynchronous fashion. After we generate the HTML or JSON  response that must be sent to the user, we call that function, this  means that the user doesn't have to wait for our PHP script to cleanup,  like closing Memcached connections, DB connections, etc. At the same  time, all the messages that where hold in an array in memory are then  sent to RabbitMQ. In this way the user doesn't have to wait for this  either.

We have two machines dedicated to  consume those messages, running at the moment 40 PHP processes in total  to consume the jobs. Each PHP process consumes 250 jobs and then dies  and respawns again. We do that to avoid any kind of garbage collection  problems with PHP. In the future we may increase the number of jobs  consumed per session in order to improve the performance, since  respawing a PHP process proved to be quite CPU intensive.

This  system lets us improve the resource management. For example during peak  time we can even have 1000 logins per minute. This means that we will  have 1000 concurrent updates to the users table, to store the user last  login time. Because now we enqueue those queries, we can run each of  them sequentially instead. If we need more processing speed we can add  more consumers to the queue, even joining machines to the cluster,  without the need of modifying any configuration or deploying any new  code.

CouchDB

To  store the logs we run CouchDB in one machine. From there we can  query/group the logs by module/action; by kind of errors, etc. It proved  to be useful to detect where the problem is. Before having CouchDB as a  log aggregator, we had to login and tail -f in each of the PHP machines  and from there try to find where the problem was. Now we relay all the  logs to the queue, and then a consumer inserts them into CouchDB. In  this way we can check for problems at a centralized place.

Graphite

We  use Graphite to collect real time information and statistics from the  website. From requests per module/action to Memcached hits/misses,  RabbitMQ status monitoring, Unix Load of the servers and much more. The  Graphite server is getting around 4800 update operations per minute.  This tool has proven to be really useful to see what's going on in the  site. It's simple text protocol and the graphing capabilities make it  easy to use and nearly plug and play to any system that we want to  monitor.

One cool thing that we did with  Graphite was monitoring two versions of the site running at the same  time. Last January we deployed our code backed by a new version of the  symfony framework. This meant that we will probably encounter  performance regressions. We were able to run one version of the site in  half of the servers while the new version was running in the others.  Then in Graphite we created Unix load graphs for each half and then  compared them live.

Since we found that the  Unix load of the new version was higher, we launched the XHProf profiler  and compared both versions. We use APC to store "flags" that lets us  enable/disable XHProf without the need of redeploying our code. We have a  separate server where we send the XHProf profiles and from there we  aggregate them and analyze them to find where the problems are.

Red5

Our  site also serves video to the users. We have two kinds of them. One are  videos from the user profiles which are movies produced and uploaded by  the users. Also we have a Video Chat to let our users interact and  share their videos. On mid 2009 we were streaming 17TB of video per  month to our users.

Tsung

Tsung  is a distributed benchmarking tool written in Erlang. We use it to do  HTTP benchmarks and also to compare different storage engines for MySQL  that we plan to use, for example the new XtraDB. We have a tool to  record traffic to the main MySQL server and convert that traffic to  Tsung benchmarking sessions. Then we replayed back that traffic and hit  the machines in our lab with thousands of concurrent users generated by  Tsung. The cool thing is that we could produce test scenarios that look  closer to what's happening in the real production environment.

Lessons  Learned

  • While Buzz Oriented Development is  cool, look for tools with an important community behind them. Documentation  and a good community are invaluable when there are problems to solve, or  when you need to incorporate people to your team. symfony provides  that, with more than 5 official books published which can be obtained  for free. CouchDB and RabbitMQ also have good support from their  developers, with active mailing list where questions are answered in  time.
  • Get to know what you are using and what  the limitations of those systems/libraries are. We learned a lot from  symfony. Where it could be tweaked and what could be improved. The same  we can say about PHP-FPM, just by reading the docs we found the mighty  fastcgi_finish_request() function which proved to be immensely useful.  Another example is Nginx, several problems that we had were already  solved by the Nginx community, like what we explained about the image  storage cache.
  • Extend the tools. If they are  working well there's no need to introduce new software into the current  stack. We have written several Nginx modules that have even been tested  by the Nginx community. In this way you contribute back.
  • Don't  be conservative with what doesn't matter. Graphite seemed to be a cool  tool for running in production, but there wasn't so many reports about  it. We just had to give it a try. If it hadn't worked, we could have  just disabled it. Nobody will cry if we couldn't get a nice graph of  Unix Load in our systems. Today even our Product Managers love it.
  • Meassure  everything: Unix Load, Site Usage, Memcached Hits/Misses ratio,  Requests per module/action, etc. Learn to interpret those metrics.
  • Create  tools that let you react to problems as fast as possible. If you have  to rollback a deployment, you don't want to spend more than a second  doing that.
  • Create tools that let you profile  the site live. In the lab most tests give optimistic information, but  then fail to cope with production load.

The  Future

  • Build a new more scalable Message  System, since the current version is quite old and wasn't designed for  such an amount of messages.
  • Move more and more  processing tasks to the queue system.
  • Add  more Erlang applications to our system. RabbitMQ has proven to work  well for us and the same we can say of CouchDB. They were systems easy  to install and deploy, increasing our trust in Erlang as a  language/platform.
  • We are looking for a replacement for Red5, probably  the Oneteam Media Server which is written in Erlang. While at the moment  we are using open source Erlang products, we may start writing our own  applications with the language because now we have experience with it.
  • Improve our log visualization tools.

I'd like to thanks Alvaro Videla for this excellent write up. If you would like to share the architecture for your fablous system, please contact me and we'll get started.