New Relic Architecture - Collecting 20+ Billion Metrics a Day

This is a guest post by Brian Doll, Application Performance Engineer at New Relic.

New Relic’s multitenant, SaaS web application monitoring service collects and persists over 100,000 metrics every second on a sustained basis, while still delivering an average page load time of 1.5 seconds.  We believe that good architecture and good tools can help you handle an extremely large amount of data while still providing extremely fast service.  Here we'll show you how we do it.

  • New Relic is Application Performance Management (APM) as a Service
  • In-app agent instrumentation (bytecode instrumentation, etc.)
  • Support for 5 programming languages (Ruby, Java, PHP, .NET, Python)
  • 175,000+ app processes monitored globally
  • 10,000+ customers

The Stats

  • 20+ Billion application metrics collected every day
  • 1.7+ Billion web page metrics collected every week
  • Each "timeslice" metric is about 250 bytes
  • 100k timeslice records inserted every second
  • 7 Billion new rows of data every day
  • Data collection handled by 9 sharded MySQL servers

Architecture Overview

  • Language-specific agents (Ruby, Java, PHP. .NET, Python) send application metrics back to New Relic once every minute
  • The "collector" service digests app metrics and persists them in the right MySQL shard
  • Real User Monitoring javascript snippet sends front-end performance data to the "beacon" service for every single page view
  • Customers log into http://rpm.newrelic.com/ to view their performance dashboard
  • The amount of data we collect every day is staggering.  Initially all data is captured at full resolution for each metric.  Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages.  For our professional accounts, we store daily metric data indefinitely, so customers can see how they've improved over the long haul.
  • Our data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series.  It's easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day. Sharding our database helps by distributing customers across multiple servers.  Within each server we have individual tables per customer to keep the customer data close together on disk and to keep the total number of rows per table down.
  • New Relic manages several types of alerts for monitoring systems.  Customers can set thresholds on their APDEX score and error rate.  New Relic also has an availability monitoring feature, so customers can get alerted on downtime events as short as 30 seconds.  We send email alerts primarily, with several customers using our PagerDuty.com integration for more complex on-call rotations with SMS integration.
  • Let's take a single web transaction from a customer request all the way through the New Relic stack.
    1. An end user views a page on Example.com, who uses New Relic to monitor their app performanc
    2. The application running Example.com is running with New Relic agents installed (for Ruby, Java, PHP, .NET or Python
    3. Detailed performance metrics are captured for each transaction, including time spent in each component, database queries, external API calls, etc
    4. These back-end metrics are persisted in the customer's New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service
    5. Meanwhile, embedded in the web page is the New Relic Real-User Monitoring JavaScript code, which tracks the performance of this single customers experience
    6. When the page is fully rendered within the customer's browser, the New Relic beacon gets a request providing performance metrics on the back-end, network, DOM processing and page rendering times.
    7. An engineer working on Example.com logs into New Relic and sees up-to-the-minute application performance metrics as well as the end-user experience for every single customer, including browser and geographic information.

Platform

Web UI

  • Ruby on Rails
  • nginx
  • Linux
  • 2 @ 12 core Intel Nehalem CPUs w/ 48Gb RAM

Data Collector and Web Beacon Services

  • Java
  • Servlets on Jetty
  • App metrics collector: 180k+ requests per minute, responding in 3ms
  • Web metrics beacon service: 200k+ requests per minute, responding in 0.15ms
  • Sharded MySQL using the Percona build
  • Linux
  • 9 @ 24 core Intel Nehalem w/ 48GB RAM, SAS attached RAID 5
  • Bare metal (no virtualization)

Interesting MySQL stats:

  • New Relic creates a database table per account per hour to hold metric data.
  • This table strategy is optimized for reads vs. writes
  • Constantly need to render charts based on one or more metrics for a specific account in a specific time window
  • The primary key for metrics (metric, agent, timestamp) allows data for a particular metric from a particular agent to be located together on disk
  • Over time this creates more page splits in innodb and I/O ops increase throughout the hour, when a new table is created
  • New accounts are assigned a specific shard in a round-robin fashion. Since some accounts are larger than others, shards are occasionally pulled out of the assignment queue to more evenly distribute load.
  • Having so many tables with this amount of data in them makes schema migrations impossible. Instead, "template" tables are used from which new timeslice tables are created.  New tables use the new definition while old tables are eventually purged from the system.  The application code needs to be aware that multiple table definitions may be active at one time.

Challenges

  • Data purging: Summarization of metrics and purging granular metric data is an expensive and nearly continuous process
  • Determining what metrics can be pre-aggregated
  • Large accounts: Some customers have many applications, while others have a staggering number of servers
  • MySQL optimization and tuning, including the OS and filesystem
  • I/O performance: bare metal db servers, table-per-account vs. large tables for read performance
  • Load balancing shards: Big accounts, small accounts, high-utilization accounts

Lessons Learned

  • New Relic monitors its own services with New Relic (staging monitors production)
  • Aim for operational efficiency and simplicity at every turn
  • Stay lean. New Relic has ~30 engineers supporting 10k customers
  • Trendy != Reliable: There are lots of essential yet boring aspects to high-performing systems that not all trendy solutions have solved for yet.
  • Use the right tech for the job. The main New Relic web application has always been a Rails app.  The data collection tier was originally written in Ruby, but was eventually ported over to Java.  The primary driver for this change was performance.  This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go.