advertise
« Building your own Facebook Realtime Analytics System | Main | Stuff The Internet Says On Scalability For July 15, 2011 »
Monday
Jul182011

New Relic Architecture - Collecting 20+ Billion Metrics a Day

This is a guest post by Brian Doll, Application Performance Engineer at New Relic.

New Relic’s multitenant, SaaS web application monitoring service collects and persists over 100,000 metrics every second on a sustained basis, while still delivering an average page load time of 1.5 seconds.  We believe that good architecture and good tools can help you handle an extremely large amount of data while still providing extremely fast service.  Here we'll show you how we do it.

  •  New Relic is Application Performance Management (APM) as a Service
  •  In-app agent instrumentation (bytecode instrumentation, etc.)
  •  Support for 5 programming languages (Ruby, Java, PHP, .NET, Python)
  •  175,000+ app processes monitored globally
  •  10,000+ customers

The Stats

  •  20+ Billion application metrics collected every day
  •  1.7+ Billion web page metrics collected every week
  •  Each "timeslice" metric is about 250 bytes
  •  100k timeslice records inserted every second
  •  7 Billion new rows of data every day
  •  Data collection handled by 9 sharded MySQL servers

Architecture Overview

  • Language-specific agents (Ruby, Java, PHP. .NET, Python) send application metrics back to New Relic once every minute
  • The "collector" service digests app metrics and persists them in the right MySQL shard
  • Real User Monitoring javascript snippet sends front-end performance data to the "beacon" service for every single page view
  • Customers log into http://rpm.newrelic.com/ to view their performance dashboard
  • The amount of data we collect every day is staggering.  Initially all data is captured at full resolution for each metric.  Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages.  For our professional accounts, we store daily metric data indefinitely, so customers can see how they've improved over the long haul.
  • Our data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series.  It's easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day. Sharding our database helps by distributing customers across multiple servers.  Within each server we have individual tables per customer to keep the customer data close together on disk and to keep the total number of rows per table down.
  • New Relic manages several types of alerts for monitoring systems.  Customers can set thresholds on their APDEX score and error rate.  New Relic also has an availability monitoring feature, so customers can get alerted on downtime events as short as 30 seconds.  We send email alerts primarily, with several customers using our PagerDuty.com integration for more complex on-call rotations with SMS integration.
  • Let's take a single web transaction from a customer request all the way through the New Relic stack.
    1. An end user views a page on Example.com, who uses New Relic to monitor their app performanc
    2. The application running Example.com is running with New Relic agents installed (for Ruby, Java, PHP, .NET or Python
    3. Detailed performance metrics are captured for each transaction, including time spent in each component, database queries, external API calls, etc
    4. These back-end metrics are persisted in the customer's New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service
    5. Meanwhile, embedded in the web page is the New Relic Real-User Monitoring JavaScript code, which tracks the performance of this single customers experience
    6. When the page is fully rendered within the customer's browser, the New Relic beacon gets a request providing performance metrics on the back-end, network, DOM processing and page rendering times.
    7. An engineer working on Example.com logs into New Relic and sees up-to-the-minute application performance metrics as well as the end-user experience for every single customer, including browser and geographic information.

Platform

Web UI

  • Ruby on Rails
  • nginx
  • Linux
  • 2 @ 12 core Intel Nehalem CPUs w/ 48Gb RAM

Data Collector and Web Beacon Services

  •  Java
  •  Servlets on Jetty
  •  App metrics collector: 180k+ requests per minute, responding in 3ms
  •  Web metrics beacon service: 200k+ requests per minute, responding in 0.15ms
  •  Sharded MySQL using the Percona build
  •  Linux
  •  9 @ 24 core Intel Nehalem w/ 48GB RAM, SAS attached RAID 5
  •  Bare metal (no virtualization)

Interesting MySQL stats:

  • New Relic creates a database table per account per hour to hold metric data.
  • This table strategy is optimized for reads vs. writes
  • Constantly need to render charts based on one or more metrics for a specific account in a specific time window
  • The primary key for metrics (metric, agent, timestamp) allows data for a particular metric from a particular agent to be located together on disk
  • Over time this creates more page splits in innodb and I/O ops increase throughout the hour, when a new table is created
  • New accounts are assigned a specific shard in a round-robin fashion. Since some accounts are larger than others, shards are occasionally pulled out of the assignment queue to more evenly distribute load.
  • Having so many tables with this amount of data in them makes schema migrations impossible. Instead, "template" tables are used from which new timeslice tables are created.  New tables use the new definition while old tables are eventually purged from the system.  The application code needs to be aware that multiple table definitions may be active at one time.

Challenges

  • Data purging: Summarization of metrics and purging granular metric data is an expensive and nearly continuous process
  • Determining what metrics can be pre-aggregated
  • Large accounts: Some customers have many applications, while others have a staggering number of servers
  • MySQL optimization and tuning, including the OS and filesystem
  • I/O performance: bare metal db servers, table-per-account vs. large tables for read performance
  • Load balancing shards: Big accounts, small accounts, high-utilization accounts

Lessons Learned

  • New Relic monitors its own services with New Relic (staging monitors production)
  • Aim for operational efficiency and simplicity at every turn
  • Stay lean. New Relic has ~30 engineers supporting 10k customers
  • Trendy != Reliable: There are lots of essential yet boring aspects to high-performing systems that not all trendy solutions have solved for yet.
  • Use the right tech for the job. The main New Relic web application has always been a Rails app.  The data collection tier was originally written in Ruby, but was eventually ported over to Java.  The primary driver for this change was performance.  This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go.  

Related Articles

Reader Comments (6)

>>Detailed performance metrics are captured for each transaction, including time spent in each component, database queries, external API calls, etc

What is the overhead of such instrumentation? In terms of latency?

July 19, 2011 | Unregistered CommenterHrish

Thanks for detailed post

Couple of questions

1. Do you do write-behind? Any details on that. which tool/technology
2. Any more details on "Load balancing shards " would be helpful
3. Why chose Java over Rails...specific performance reasons.

Thanks.

July 19, 2011 | Unregistered CommenterSubhash

Within each server we have individual tables per customer to keep the customer data close together on disk and to keep the total number of rows per table down.

Perhaps they don't know how file systems work? Unless you pre-allocate the block it will be written randomly on the disk as the file grows, just like everything else. They seem to think that by putting things into their own tables with the "innodb_file_per_table" option that doesn't guarantee that the file will be clumped on disk.

July 19, 2011 | Unregistered Commenterjstephens

@Hrish: "What is the overhead of such instrumentation? In terms of latency?"

The overhead of our instrumentation is something we continually monitor and ensure that it stays as low as possible. The overhead will vary application to application, but is often in the 5% range. This overhead pales in comparison to the opportunity cost of not having visibility into production systems. I've seen many companies improve their response times dramatically within just a few days of using New Relic, so that overhead more than pays for itself.


@Subhash:
1. Do you do write-behind? Any details on that. which tool/technology

No, we don't do write-behind.


2. Any more details on "Load balancing shards " would be helpful

We fit into a classic sharding pattern, since customer metrics are unique to them and we never need to refer to metrics across accounts. Ideally, we'd like to distribute the load across each of our shards. Load in this case refers both to data volume (how much data is each shard receiving in a given period) and query volume (how actively are those customers accessing their metrics). We allow an unlimited number of users per account, so some accounts have many users accessing their performance data at once. Together, we look at these metrics along with system metrics like CPU and memory, to determine how even the distribution of load is across our shards.

We assign new accounts to a specific shard in a round-robin fashion. This sounds incredible simple, and it is. Premature optimization here would likely require a lot of effort to manage with very little benefit. You just can't tell how "busy" a new account is going to be when they sign up. From time to time, we'll notice that a particular shard is busier than the rest. This may be due to a very large new account, or increased utilization on existing accounts. To help balance the shards out, we just remove that shard from the assignment queue, so it doesn't get any new accounts for a while. Since we have new accounts signing up all the time, the rest of the shards will catch up and we'll add it back in.

I've talked to several people who run shards in production for companies with similar data requirements, and it seems that this approach is fairly common. As they say, do the simplest thing that could work :)


3. Why chose Java over Rails...specific performance reasons.

The service endpoints for data collection were ported from Ruby to Java. These services are very robust and don't change that often. They need to take metric data in from hundreds of thousands of application instances and put them in the right database tables on the right shards. Besides durability, the most important characteristic of these servies is raw speed. With current ruby implementations, it is just not possible to compete with Java servlets for speed and efficiency. Our Real User Monitoring beacon service, which is deployed as a Java servlet, has a response time of 0.15ms. Fifteen one-hundredths of a millisecond! And this is at 280k+ requests per minute.

July 19, 2011 | Unregistered CommenterBrian Doll

You put special accent that db servers are bare metal. Does that mean other stuff is virtualized?
I'm surprised that you do run your own servers. I always thought you were one of those early cloud adopters. :)

Are you running in a single datacenter? How do you handle high availability?

July 19, 2011 | Registered Commentermxx

"[overhead] is often in the 5% range"

this has no meaning whatsoever unless one states the degree of instrumentation performed and the latency of the major components instrumented i.e. a incredibly slow rails app w/ even slower db access.

In comparison with unit costs of other products NewRelic is close to 10,000-100,000 times slower.
http://williamlouth.wordpress.com/2011/03/17/which-ruby-vm-consider-monitoring/

By the way I would curious to know how multiple SQL statements like in this screenshot below of NewRelic can be packaged up in 250 bytes.
http://williamlouth.wordpress.com/2010/03/02/does-transaction-tracing-scale-analysis/

I think you should also point out that the number of pages measured is not the same as the number of inbound calls to your service considering you batch up high level metric aggregates in 1 minute sample windows before sending.

If you are going to be in the APM business can you at least try to be accurate and more transparent on what a number actually means and relates too.

September 7, 2011 | Unregistered CommenterWilliam Louth

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>