« Stuff The Internet Says On Scalability For May 31, 2013 | Main | Amazon: Creating a Customer Utopia One Culture Hack at a Time »

Google Finds NUMA Up to 20% Slower for Gmail and Websearch

When you have a large population of servers you have both the opportunity and the incentive to perform interesting studies. Authors from Google and the University of California in Optimizing Google’s Warehouse Scale Computers: The NUMA Experience conducted such a study, taking a look at how jobs run on clusters of machines using a NUMA architecture. Since NUMA is common on server class machines it's a topic of general interest for those looking to maximize machine utilization across clusters.

Some of the results are surprising:

  • The methodology of how to attribute such fine performance variations to NUMA effects within such a complex system is perhaps more interesting than the results themselves. Well worth reading just for that story.
  • The performance swing due to NUMA is up to 15% on AMD Barcelona for Gmail backend and 20% on Intel Westmere for Web-search frontend. 
  • Memory locality is not always King. Because of the interaction between NUMA and cache sharing/contention it turns out different applications prefer different profiles for cache sharing, remote access,  cache contention, etc. More remote memory accesses can outperform more local accesses significantly (by up to 12%). Bigtable, for example, performs better with 100% remote accesses than 50% remote accesses. Which means a simple NUMA-aware scheduling can yield sizable benefits in production. The optimal mapping is highly dependent on the applications and their co-runners.
  • Gmail’s backend server experienced around a 4x range in average request latency during a week’s time. 

What did Google find?

Both our production WSC analysis and load-test experiments show that the performance impact of NUMA is significant for large scale web-service applications on modern multicore servers. In our study, the performance swing due to NUMA is up to 15% on AMD Barcelona for Gmail backend and 20% on Intel Westmere for Web-search frontend. Using the load-test, we also observed that on multicore multisocket machines, there is often a tradeoff between optimizing NUMA performance by clustering threads close to the memory nodes to increase the amount of local accesses and optimizing for cache performance by spreading threads to reduce the cache contention. For example, bigtable benefits from cache sharing and would prefer 100 % remote accesses to 50% remote. Search-frontend prefers spreading the threads to multiple caches to reduce cache contention and thus also prefers 100 % remote accesses to 50% remote.

In conclusion, surprisingly, some running scenarios with more remote memory accesses may outperform scenarios with more local accesses due to an increased amount of cache contention for the latter, especially when 100% local accesses cannot be guaranteed. This tradeoff between NUMA and cache sharing/contention varies for different applications and when the application’s corunner changes. The tradeoff also depends on the remote access penalty and the impact of cache contention on a given machine platform. On our Intel Westmere, more often, NUMA has a more significant impact than cache contention. This may be due to the fact that this platform has a fairly large shared cache while the remote access latency is as large as 1.73x of local latency. 

In this work, we show that remote memory accesses have a significant performance impact on NUMA machines for these applications. And different from UMA machines, remote access latency is often a more dominating impact than cache contention on NUMA machines. This indicates that a simple NUMA-aware scheduling can already yield sizable benefits in production for those platforms.
Based on our findings, NUMA-aware thread mapping is implemented and in the deployment process in our production WSCs. Considering both contention and NUMA may provide further performance benefit. However the optimal mapping is highly dependent on the applications and their co-runners. This indicates additional benefit for adaptive thread mapping at the cost of added implementation complexity


Related Articles

Reader Comments (4)

I think you should read the article more closely.

The title is wrong -- Gmail and Web search frontends actually saw 15% and 20% improvements, respectively (lower CPI is better).

The subtlety is that there were some workloads that saw performance regression from NUMA (such as BigTable accesses) -- but it was not the workloads you mention in the title.

May 30, 2013 | Unregistered CommenterChad Walters

It looks to me like they are saying NUMA is 15-20% faster, not slower.

May 31, 2013 | Unregistered CommenterAlexey Romanov

The paper states Gmail is 15-20% slower with remote accesses in NUMA. Remote accesses being to memory situated local to another CPU socket.

The summary of the paper indicates that the impact of NUMA is application sensitive as BigTable actually improves with more remote accesses.

"Performance swing" means variation like a swinging pendulum it does not imply an improvement or regression in peak performance. A higher variation however means resources are more difficult to predict.

June 1, 2013 | Unregistered CommenterSteve-o

Which was for me the interesting part Steve, that you can take a look at your application portfolio and optimize returns via a "simple NUMA-aware scheduling." An interesting finding when the usual thinking is locality, locality, locality.

June 2, 2013 | Registered CommenterTodd Hoff

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>