AppLovin: Marketing to Mobile Consumers Worldwide by Processing 30 Billion Requests a Day

This is a guest post from AppLovin's VP of engineering, Basil Shikin, on the infrastructure of its mobile marketing platform. Major brands like Uber, Disney, Yelp and Hotels.com use AppLovin's mobile marketing platform. It processes 30 billion requests a day and 60 terabytes of data a day.

AppLovin's marketing platform provides marketing automation and analytics for brands who want to reach their consumers on mobile. The platform enables brands to use real-time data signals to make effective marketing decisions across one billion mobile consumers worldwide.

Core Stats

30 Billion ad requests per day

300,000 ad requests per second, peaking at 500,000 ad requests per second

5ms average response latency

3 Million events per second

60TB of data processed daily

~1000 servers

9 data centers

~40 reporting dimensions

500,000 metrics data points per minute

1 Pb Spark cluster

15GB/s peak disk writes across all servers

9GB/s peak disk reads across all servers

Founded in 2012, AppLovin is headquartered in Palo Alto, with offices in San Francisco, New York, London and Berlin.

Technology Stack

Third Party Services

Github for code

Asana for project management

HipChat for communication

Data Storage

Aerospike for user profile storage

Vertica for aggregated statistics and real-time reporting

Aggregating 350,000 rows per second and writing to Vertica at 34,000 rows per second

Peak 12,000 user profiles per second written to Aerospike

MySQL for ad data

Spark for offline processing and deep data analysis

Redis for basic caching

Thrift for all data storage and transfers

Each data point replicated in 4 data centers

Each service is replicated at least in 2 data centers (at most in 8)

Amazon Web Services used for long term data storage and backups

Core App And Services

Custom C/C++ Nginx module for high performance ad serving

Java for data processing and auxiliary services

PHP / Javascript for UI

Jenkins for continuous integration and deployment

Zookeeper for distributed locking

HAProxy and IPVS for high availability

Coverity for Java/C++ static code analysis

Checkstyle and PMD for PHP static code analysis

Syslog for DC-centralized log server

Hibernate for transaction-based services

Servers and Provisioning

Ubuntu

Cobbler for bare metal provisioning

Chef for configuring servers

Berkshelf for Chef dependencies

Docker with Test Kitchen for running infrastructure tests

A combination of software (ipvs, haproxy) and hardware load balancers. Plan to gradually move away from traditional hardware load balancers.

Monitoring Stack

Server Monitoring

Icinga for all servers

~100 custom Nagios plugins for deep server monitoring

550 various probes per server

Graphite as data format

Grafana for displaying all graphs

PagerDuty for issue escalation

Smokeping for network mesh monitoring

Application Monitoring

VividCortex for MySQL monitoring

JSON /health endpoint on each service

Cross-DC database consistency monitoring

9 4K 65” TVs for showing all graphs across the office

Statistical deviation monitoring

Fraudulent users monitoring

Third-party systems monitoring

Deployments are recorded in all graphs

Intelligent Monitoring

Intelligent alerting system with a feedback loop: a system that can introspect anything can learn anything

Third-party stats about AppLovin are also monitored

Alerting is a cross-team exercise: developers, ops, business, data scientists are involved

Architecture Overview

General Considerations

Store everything in RAM

If it does not fit, save it to SSD

L2 cache level optimizations matter

Use right tool for the right job

Architecture allows swapping any component

Upgrade only if an alternative is 10x better

Write your own components if there is nothing suitable out there

Replicate important data at least 3x

Make sure every message can be re-played without data corruption

Automate everything

Zero-copy message passing

Message Processing

Custom message processing system that guarantees message delivery

3x replication for each message

Sending a message = writing to disk

Any service may fail, but no data are lost

Message dispatching system connects all components together, provides isolation and extensibility of the system

Cross-DC failure tolerance

Ad Serving

Nginx is really fast: can serve an ad in less than a millisecond

Keep all ad serving data in memory: read only

jemalloc gave a 30% speed improvement

Use Aerospike for user profiles: less than 1ms to fetch a profile

Pre-compute all ad serving data on one box and dispatch across all servers

Torrents are used to propagate serving data across all servers. Using Torrents resulted in 83% network load drop on the originating server compared to HTTP-based distribution.

mmap is used to share ad serving data across nginx processes

XXHash is the fastest hash function with a low collision rate. 75x faster than SHA-1 for computing checksums

5% of real traffic goes to staging environment

Ability to run 3 A/B tests at once (20%/20%/10% of traffic for three separate tests, 50% for control)

A/B test results are available in regular reporting

Data Warehouse

All data are replicated

Running most reports takes under 2 seconds

Aggregation is key to allow fast reports on large amounts of data

Non-aggregated data for the last 48 hours is usually to resolve most issues

7 days of raw logs is usually enough for debug

Some reports must be pre-computed

Always think multiple data centers: every data point goes to a multiple locations

Backup in S3 for catastrophic failures

All raw data are stored in Spark cluster

Team

Structure

70 full-time employees

15 developers (platform, ad serving, frontend, mobile)

4 data scientists

5 dev. ops.

Engineers in Palo Alto, CA

Business in San Francisco, CA

Offices in New York, London and Berlin

Interaction

HipChat to discuss most issues

Asana for project-based communication

All code is reviewed

Frequent group code reviews

Quarterly company outings

Regular town hall meetings with CEO

All engineers (junior to CTO) write code

Interviews are tough: offers are really rare

Development Cycle

Developers, business side or data science team comes up with an idea

Idea is reviewed and scheduled to be executed on a Monday meeting

Feature is implemented in a branch; development environment is used for basic testing

A pull request is created

Code is reviewed and iterated upon

For big features group code reviews are common

The feature gets merged to master

The feature gets deployed to staging with the next build

The feature gets tested on 5% real traffic

Statistics are examined

If the feature is successful it graduates to production

Feature is closely monitored for couple days

Avoiding Issues

  • The system is designed to handle failure of any component

  • No failure of a single component can harm ad serving or data consistency

  • Omniscient monitoring

  • Engineers watch and analyze key business reports

  • High quality of code is essential

  • Some features take multiple code reviews and iterations before graduating

  • Alarms are triggered when:

    • Stats for staging are different from production

    • FATAL errors on critical services

    • Error rate exceeds threshold

    • Any irregular activity is detected

  • data are never dropped

  • Most log lines can be easily parsed

  • Rolling back of any change is easy by design

  • After every failure: fix, make sure same thing does not happen in the future, and add monitoring

Lessons Learned

Product Development

Being able to swap any component easily is key to growth

Failures drive innovative solutions

Staging environment is essential: always be ready to loose 5%

A/B testing is essential

Monitor everything

Build intelligent alerting system

Engineers should be aware of business goals

Business people should be aware of limitations of engineering

Make builds and continuous integration fast. Jenkins run on a 2 bare metal servers with 32 CPU, 128G RAM and SSD drives

Infrastructure

Monitoring all data points is critical

Automation is important

Every component should support HA by design

Kernel optimizations can have up to 25% performance improvement

Process and IRQ balancing lead to another 20% performance improvement

Power saving features impact performance

Use SSDs as much as possible

When optimizing, profile everything. Flame graphs are great!

On Hacker News

Read more