Performance

Serving 250M quotes/day at CNBC.com with aiCache

Todd Hoff

16 Apr 2009 — 4 min read

As traffic to cnbc.com continued to grow, we found ourselves in an all-too-familiar situation where one feels that a BIG change in how things are done was in order, the status-quo was a road to nowhere. The spending on HW, amount of space and power required to host additional servers, less-than-stellar response times, having to resort to frequent "micro"-caching and similar tricks to try to improve code performance - all of these were surfacing in plain sight, hard to ignore.

While code base could clearly be improved, the limited Dev resources and having to innovate to stay competitive always limits ability to go about refactoring. So how can one go about addressing performance and other needs without a full blown effort across the entire team ? For us, the answer was aiCache - a Web caching and application acceleration product (aicache.com).

The idea behind caching is simple - handle the requests before they ever hit your regular Apache<->JK<->Java<->Database response generation train (we're mostly a Java shop). Of course, it could be Apache-PHP-Database or some other backend system, with byte-code and/or DB-result-set caching. In our case we have many more caching sub-systems, aimed at speeding up access to stock and company-related information. Developing for such micro-caching and having to maintain systems with such micro-caching sprinkled throughout is not an easy task. Nor is troubleshooting. But we digress...

aiCache takes this basic idea of caching and front-ending the user traffic to your Web environment to a whole new level. I don't believe any of aiCache's features are revolutionary in nature, rather it is the sheer number of features it offers that seems to address our every imaginable need.

We've also discovered that aiCache provides virtually unlimited performance, combined with incredible configuration flexibility and support for real-time reporting and alerting.

In interest of space, here're some quick facts about our experience with the product, in no particular order:

· Runs on any Linux distro, our standard happens to be RedHat 5, 64bit on HP DL360G5

· The responses are cached in the RAM, not on disk. No disk IO, ever (well, outside of access and error logging, but even that is configurable). No latency for cached responses - stress tests show TTFB at 0 ms. Extremely low resource utilization - aiCache servers serving in excess of 2000 req/sec are reported to be 99% idle ! Being not a trusting type, I verified the vendor's claim and stress tested these to about 25,000 req/sec per server - with load averages of about 2 (!).

· We cache both GET and POST results, with query and parameter busting (selectively removing those semi-random parameters that complicate caching)

· For user comments, we use response-driven expiration to refresh comment threads when a new comment is posted.

· Had a chance to use site-fallback feature (where aiCache serves cached responses and shields origin servers from any traffic) to expedite service recovery

· Used origin-server tagging a few times to get us out of code-deployment-gone-bad situations.

· We average about 80% caching ratios across about 10 different sub-domains, with some as high as 97% cache-hit-ratio. Have already downsized a number of production Web farms, having offloaded so much traffic from origin server infrastructure, we see much lower resource utilization across Web, DB and other backend systems

· Keynote reports significant improvement in response times - about 30%.

· Everyone just loves real-time traffic reporting, this is a standard window on many a desktop now. You get to see req/sec, response time, number of good/bad origin servers, client and origin server connections, input and output BW and so on - all reported per cached sub-domain. Any of these can be alerted on.

· We have wired up Nagios to read/chart some of aiCache extensive statistics via SNMP, pretty much everything imaginable is available as an OID.

· Their CLI interface is something I like a lot too: you see the inventory of responses, can write out any response, expire responses, report responses sorted by request, size, fill time, refreshes and so on, in real time, no log crunching is required. Some commands are cluster-aware, so you only execute them on one node and they are applied across.

Again, the list above is a small sample of product features that we use, there're many more that we use or explore using. Their admin guide weighs in at 140 pages (!) - and it is all hard-core technical stuff that I happen to enjoy.

Some details about our network setup . We use F5 load balancers and have configured the virtual IPs to have both aiCache servers _and origin server enabled at the same time. Using F5's VIP priority feature, we direct all of the traffic to aiCache servers, as long as at least one is available, but have ability to automatically, or on demand, failover all of the traffic to origin servers.

We also use a well known CDN to serve auxiliary content - Javascript, CSS and imagery.

I stumbled upon the product following a Wikipedia link, requested a trial download and was up and running in no time. It probably helped that I have experience with other caching products - going back to circa 2000, using Novell ICS. But it all mostly boils down to knowing what URLs can be cached and for how long.

And lastly - when you want stress test aiCache, make sure to hit it directly, right by server's IP - otherwise you will most likely melt down one or more of other network infrastructure components !

A bit about myself: an EE major, have been working with Internet infrastructures since 1992 - from an ISP in Russia (uucp over MNP-5 2400b modem seemed blazing fast back then!) to designing and running infrastructures of some of the busier sites for CNBC and NBC - cnbc.com, NBC's Olympics website and others.

Rashid Karimov, Platform, CNBC.com

Serving 250M quotes/day at CNBC.com with aiCache

Todd Hoff

Read more

Kafka 101

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale