HackerEarth is a coding skill practice and testing service that in a series of well written articles describes the trials and tribulations of building their site and how they overcame them: Scaling Python/Django application with Apache and mod_wsgi, Programming challenges, uptime, and mistakes in 2013, Post-mortem: The big outage on January 25, 2014, The Robust Realtime Server, 100,000 strong - CodeFactory server, Scaling database with Django and HAProxy, Continuous Deployment System, HackerEarth Technology Stack.
What characterizes these articles and makes them especially helpful is a drive for improvement and an openness towards reporting what didn't work and how they figured out what would work.
As they say, mistakes happen when you are building a complex product with a team of just 3-4 engineers, but investing in infrastructure allowed them to take more breaks, roam the streets of Bangalore while their servers are happily serving thousands of requests every minute, while reaching a 50,000 user base with ease.
Here's a gloss on how they did it:
Current Architecture at HackerEarth: Frontend server(s); API server(s); Code-checker server(s); Search server(s) - Apache Solr & Elastic Search; Realtime server - written using Tornado; Status server; Toolchain server (Mainly used for continuous deployment); Integration Test server; Log server; Memcached server; Few more servers for data crunching processing analytics database and background jobs; RabbitMQ, Celery, etc. which glues many servers; monitoring servers; databases are sharded and are load balanced behing HAProxy.
- Remove unnecessary Apache modules. Saves memory and improves performance. By including only what you need you can cut in half the number of modules loaded.
- Use Apache MPM (Multi-Processing Module) worker. Generally a better choice for high-traffic servers because it has a smaller memory footprint than the prefork MPM.
- KeepAlive Off. Static files are served from CloudFront and experimentation showed this was more efficient, processes/threads are free to handle new requests instantaneously rather than waiting for a request to arrive on the older connection.
- Daemon Mode of mod_wsgi. The number of threads and processes is constant, which makes resource consumption predictable and protects against traffic spikes.
- Tweaking mpm-worker configuration. They show the configuration they use after much experimentation, which favors their application type, which is more CPU intensive than memory intensive.
- Check configuration. Enable modules mod_status.so and mod_info.so to see how Apache is being run. This information helped them significantly reduced the number of servers we had to run and made the application more stable and resilient to traffic bursts.
- Nothing scales automatically. 100% uptime is a constant struggle. Roll up your sleeves and work towards that goal.
- Don’t take pride in running 100 servers. Write better code and tune your system. There's no pride in throwing servers at a large number of requests. This means making sure, for example, that a request doesn't query the database 20 times.
- Asynchronous code-checker server queueing system. Rewriting the code-checker server queueing system to make it asynchronous significantly reduced the process overhead on their frontend servers.
- Use Tornado for serious parallel work. “socket.io” module is not able to scale past 150 simultaneous connections. Nowjs also leaked file descriptors.
- Shard database and database routers. Sharding the database reduced overhead on single database and further reduced query latencies.
- Cache it. Over a million key-value pairs in memcached, sessions are maintained in redis, any other persistent data goes into MySQL or S3, but most is cached for some suitable lifetime.
- Deploy continuously. Updating code changes in production manually would have driven them crazy and would have been a total waste of time.