This is a guest post by Jamie Hall, Co-founder & CTO of MocoSpace, describing the architecture for their mobile social network. This is a timely architecture to learn from as it combines several hot trends: it is very large, mobile, and social. What they think is especially cool about their system is: how it optimizes for device/browser fragmentation on the mobile Web; their multi-tiered, read/write, local/distributed caching system; selecting PostgreSQL over MySQL as a relational DB that can scale.
MocoSpace is a mobile social network, with 12 million members and 3 billion page views a month, which makes it one of the most highly trafficked mobile Websites in the US. Members access the site mainly from their mobile phone Web browser, ranging from high end smartphones to lower end devices, as well as the Web. Activities on the site include customizing profiles, chat, instant messaging, music, sharing photos & videos, games, eCards and blogs. The monetization strategy is focused on advertising, on both the mobile and Websites, as well as a virtual currency system and a handful of premium feature upgrades.
- 3 billion page views a month
- Top 4 most trafficked mobile website after MySpace, Facebook and Google (http://www.groundtruth.com/mobile-is-mobile)
- 75% mobile Web, 25% Web
- 12 million members
- 6 million unique visitors a month
- 100k concurrent users
- 12 million photo uploads a month
- 2 million emails received a day, 90% spam, 2.5 million sent a day
- Team of 8 developers, 2 QA, 1 sysadmin
- CentOS + Red Hat
- Resin application server — Java Servlets, JavaServer Pages, Comet
- ActiveMQ's job + message queue, in Red Hat cluster for high availability
- Squid - static content caching, tried Varnish but had stability issues
- JQuery + Ajax
- S3 for user photo & video storage (8 TB) and EC2 for photo processing
- F5 BigIP load balancers - sticky sessions, gzip compression on all pages
- Akamai CDN - 2 TB a day, 250 million requests a day
- Monitoring - Nagios for alerts, Zabbix for trending
- EMC SAN - high IO performance for databases by RAIDing (RAID 10) lots of disks, replacing with high performance Fusion-io solid-state flash ioDrives, much more cost effective
- PowerMTA for mail delivery, Barracuda spam firewalls
- Subversion source control, Hudson for continuous integration
- FFMPEG for Mobile to Web and Web to mobile video transcoding
- Selenium for browser test case automation
- Web tier
- 5x Dell 1950, 2x dual core, 16G RAM
- 5x Dell 6950/R905, 4x dual core, 32G RAM
- Database tier
- 2x Sun Fire X4600 M2 Server, 8x quad core, 256G RAM
- 2x Dell 6950, 4x dual core, 64G RAM
- A big challenge is handling the device/browser fragmentation on the mobile Web - optimizing for a huge range of device capabilities (everything from iPhones with touchscreens to 5 year old Motorola Razrs), screen sizes, lack of / inconsistent Web standards compliance, etc. We abstract out our presentation layer so we can serve pages to all mobile devices from the same code base, and maintain a large device capabilities database (containing things like screen size, supported file types, maximum allowed page sizes, etc) which is used to drive generation of our pages. The database contains capability details for hundreds of devices and mobile browser types.
- Database is sharded based on a user key, with a master lookup table mapping users to shards. We rolled our own query and aggregation layer, allowing us to query and join data across shards, though this is not used frequently. With sharding we sacrifice some consistency, but that's Ok as long as you're not running a bank. We perform data consistency checks offline, in batches, with the goal being eventual consistency. Large tables are partitioned into smaller sub tables for more efficient access, reducing time tables are locked for updates as well as operational maintenance activities. Log shipping used for warm standbys.
- A multi-tiered caching system is used, with data cached locally within the application servers as well as distributed via Memcached. When making an update we don't just invalidate the cache and then re-populate after reading again from the database, rather we update Memcached with the new data and save another trip to the database. When updating the cache an invalidation directive is sent via the messaging queue to the local caches on each of the application servers.
- A distributed message queue is used for distributed server communication, everything from sending messages in realtime between users to system messages such as local cache invalidation directives.
- Dedicated server for building and traversing social graph entirely in memory, used to generate friend recommendations, etc.
- Load balancer used for rolling deploys of new versions of the site without affecting performance/traffic.
- Release every 2 weeks. Longer release cycles = exponential complexity, more difficult to troubleshoot and rollback. Development team responsible for deploying to and managing production systems ¿ ¿you built it, you manage it¿.
- Kickstart used to automate server builds from bare metal. Puppet is used to configure a server to a specific personality i.e. Webserver, database server, cache server etc, as well as to push updated policies to running nodes.
- Make your boxes sweat. Don't be afraid of high system load as long as response times are acceptable. We pack as many as five application server instances on a single box, each running on a different port. Scale up to the high end of commodity hardware before scaling out. Can pick up used or refurbished powerful 4U boxes for cheap.
- Understand where your bottlenecks are in each tier. Are you CPU or IO (disk, network) bound? Database is almost always IO (disk) bound. Once the database doesn't fit in RAM you hit a wall.
- Profile the database religiously. Obsess when it comes to optimizing the database. Scaling Web tier is cheap and easy, database tier is much harder and expensive. Know the top queries on your system, both by execution time and frequency. Track and benchmark top queries with each release, need to catch and address performance issues with the database early. We use the pgFouine log analyzer and new PostgreSQL pg_stat_statements utility for generating profiling snapshots in real-time.
- Design to disable. Be able to configure and turn off anything you release instantly, without requiring a code change or deployment. Load and stress testing are important but nothing like testing with live, production traffic via incremental, phased rollouts.
- Communicate synchronously only when absolutely necessary. When one component or service fails it shouldn't bring down other parts of the system. Do everything you can in the background or as a separate thread or process, don't make the user wait. Update the database in batches wherever possible. Any system making requests outside the network need aggressive monitoring, timeout settings, and failure handling / retries. For example, we found S3 latency and failure rates can be significant, so we queue failed calls and retry later.
- Think about monitoring during design, not after. Every component should produce performance, health, throughput, etc data. Set up alerts when these exceed thresholds. Consolidated graphs showing metrics across all instances, rather than just per instance, are particularly helpful for identifying issues and trends at a glance and should be reviewed daily — if normal operating behavior isn't well understood it's impossible to identify and respond to what isn't. We tried many monitoring systems - Cacti, Ganglia, Hyperic, Zabbix, Nagios, as well as some custom built solutions. Whichever you use the most important thing is to be comfortable with it, otherwise you won't use it enough. It should be easy, using templates, etc to quickly monitor new boxes and services as you throw them up.
- Distributed sessions can be a lot of overhead. Go stateless when you can, but when you can't consider sticky sessions. If the server fails the user loses their state and may need to re-login, but that's rare and acceptable depending on what you need to do.
- Monitor and beware of full/major garbage collection events in Java, which can lock up the whole JVM for up to 30 seconds. We use Concurrent Mark Sweep (CMS) garbage collector, which introduces some additional system overhead, but have been able to eliminate full garbage collections.
- When a site gets large enough it becomes a magnet for spammers and hackers, both on site and from outside via email, etc. Captcha and IP monitoring are not enough. Must invest aggressively in detection and containment systems, internal tools to detect suspicious user behavior and alert and/or attempt to automatically contain.
- Soft delete whenever possible. Mark data for later deletion, rather than deleting immediately. Deletion can be costly, so queue up for after hours, plus if someone makes a mistake and deletes something they shouldn¿t have it¿s easy to rollback.
- N+1 design. Never less than two of anything.
I'd really like to thank Jamie for taking the time write this experience report. It's a really useful contribution for others to learn from. If you would like to share the architecture for your fablous system, please contact me and we'll get started.