Example

New Relic Architecture - Collecting 20+ Billion Metrics a Day

This is a guest post by Brian Doll, Application Performance Engineer at New Relic.

New Relic’s multitenant, SaaS web application monitoring service collects and persists over 100,000 metrics every second on a sustained basis, while still delivering an average page load time of 1.5 seconds. We believe that good architecture and good tools can help you handle an extremely large amount of data while still providing extremely fast service. Here we'll show you how we do it.

New Relic is Application Performance Management (APM) as a Service
In-app agent instrumentation (bytecode instrumentation, etc.)
Support for 5 programming languages (Ruby, Java, PHP, .NET, Python)
175,000+ app processes monitored globally
10,000+ customers

The Stats

20+ Billion application metrics collected every day
1.7+ Billion web page metrics collected every week
Each "timeslice" metric is about 250 bytes
100k timeslice records inserted every second
7 Billion new rows of data every day
Data collection handled by 9 sharded MySQL servers

Architecture Overview

Language-specific agents (Ruby, Java, PHP. .NET, Python) send application metrics back to New Relic once every minute
The "collector" service digests app metrics and persists them in the right MySQL shard
Real User Monitoring javascript snippet sends front-end performance data to the "beacon" service for every single page view
Customers log into http://rpm.newrelic.com/ to view their performance dashboard
The amount of data we collect every day is staggering. Initially all data is captured at full resolution for each metric. Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages. For our professional accounts, we store daily metric data indefinitely, so customers can see how they've improved over the long haul.
Our data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series. It's easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day. Sharding our database helps by distributing customers across multiple servers. Within each server we have individual tables per customer to keep the customer data close together on disk and to keep the total number of rows per table down.
New Relic manages several types of alerts for monitoring systems. Customers can set thresholds on their APDEX score and error rate. New Relic also has an availability monitoring feature, so customers can get alerted on downtime events as short as 30 seconds. We send email alerts primarily, with several customers using our PagerDuty.com integration for more complex on-call rotations with SMS integration.
Let's take a single web transaction from a customer request all the way through the New Relic stack.
1. An end user views a page on Example.com, who uses New Relic to monitor their app performanc
2. The application running Example.com is running with New Relic agents installed (for Ruby, Java, PHP, .NET or Python
3. Detailed performance metrics are captured for each transaction, including time spent in each component, database queries, external API calls, etc
4. These back-end metrics are persisted in the customer's New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service
5. Meanwhile, embedded in the web page is the New Relic Real-User Monitoring JavaScript code, which tracks the performance of this single customers experience
6. When the page is fully rendered within the customer's browser, the New Relic beacon gets a request providing performance metrics on the back-end, network, DOM processing and page rendering times.
7. An engineer working on Example.com logs into New Relic and sees up-to-the-minute application performance metrics as well as the end-user experience for every single customer, including browser and geographic information.

Platform

Web UI

Ruby on Rails
nginx
Linux
2 @ 12 core Intel Nehalem CPUs w/ 48Gb RAM

Data Collector and Web Beacon Services

Java
Servlets on Jetty
App metrics collector: 180k+ requests per minute, responding in 3ms
Web metrics beacon service: 200k+ requests per minute, responding in 0.15ms
Sharded MySQL using the Percona build
Linux
9 @ 24 core Intel Nehalem w/ 48GB RAM, SAS attached RAID 5
Bare metal (no virtualization)

Interesting MySQL stats:

New Relic creates a database table per account per hour to hold metric data.
This table strategy is optimized for reads vs. writes
Constantly need to render charts based on one or more metrics for a specific account in a specific time window
The primary key for metrics (metric, agent, timestamp) allows data for a particular metric from a particular agent to be located together on disk
Over time this creates more page splits in innodb and I/O ops increase throughout the hour, when a new table is created
New accounts are assigned a specific shard in a round-robin fashion. Since some accounts are larger than others, shards are occasionally pulled out of the assignment queue to more evenly distribute load.
Having so many tables with this amount of data in them makes schema migrations impossible. Instead, "template" tables are used from which new timeslice tables are created. New tables use the new definition while old tables are eventually purged from the system. The application code needs to be aware that multiple table definitions may be active at one time.

Challenges

Data purging: Summarization of metrics and purging granular metric data is an expensive and nearly continuous process
Determining what metrics can be pre-aggregated
Large accounts: Some customers have many applications, while others have a staggering number of servers
MySQL optimization and tuning, including the OS and filesystem
I/O performance: bare metal db servers, table-per-account vs. large tables for read performance
Load balancing shards: Big accounts, small accounts, high-utilization accounts

Lessons Learned

New Relic monitors its own services with New Relic (staging monitors production)
Aim for operational efficiency and simplicity at every turn
Stay lean. New Relic has ~30 engineers supporting 10k customers
Trendy != Reliable: There are lots of essential yet boring aspects to high-performing systems that not all trendy solutions have solved for yet.
Use the right tech for the job. The main New Relic web application has always been a Rails app. The data collection tier was originally written in Ruby, but was eventually ported over to Java. The primary driver for this change was performance. This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go.

How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers by Lew Cirne

New Relic Architecture - Collecting 20+ Billion Metrics a Day

The Stats

Architecture Overview

Platform

Web UI

Data Collector and Web Beacon Services

Interesting MySQL stats:

Challenges

Lessons Learned

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale

The Stats

Architecture Overview

Platform

Web UI

Data Collector and Web Beacon Services

Interesting MySQL stats:

Challenges

Lessons Learned

Related Articles

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale