In the article The Onion Uses Django, And Why It Matters To Us, a lot of interesting points are made about their ambitious infrastructure move from Drupal/PHP to Django/Python: the move wasn't that hard, it just took time and work because of their previous experience moving the A.V. Club website; churn in core framework APIs make it more attractive to move than stay; supporting the structure of older versions of the site is an unsolved problem; the built-in Django admin saved a lot of work; group development is easier with "fewer specialized or hacked together pieces"; they use IRC for distributed development; sphinx for full-text search; nginx is the media server and reverse proxy; haproxy made the launch process a 5 second procedure; capistrano for deployment; clean component separation makes moving easier; Git for version control; ORM with complicated querysets is a performance problem; memcached for caching rendered pages; the CDN checks for updates every 10 minutes; videos, articles, images, 404 pages are all served by a CDN.
But the most surprising point had to be:
And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago. [Edit: We dropped our outgoing bandwidth by about 66% and our load average on our web server cluster by about 50% after implementing that change].
A minority of our links are from old content that I no longer know the urls for. We redirect everything from 5 years ago and up. Stuff originally published 6-10 years ago could potentially be redirected, but none of it came from a database and was all static HTML in its initial incarnation and redirects weren't maintained before I started working for The Onion.
Spiders make up the vast majority of my 404s. They request URIs that simply are not present in our markup. I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.
Our 404 pages were not cached by the CDN. Allowing them to be cached reduced the origin penetration rate substantially enough to amount to a 66% reduction in outgoing bandwidth over uncached 404s.
Edit: This is not to say that our 404s were not cached at our origin. Our precomputed 404 was cached and served out without a database hit on every connection, however this still invokes the regular expression engine for url pattern matching and taxes the machine's network IO resources.
No joke, irony, or sarcasm intended. Most of this traffic is from spiders looking for made up pages so preserving URLs isn't the issue. The issue was reducing the impact of these poisonous spiders and caching the 404 page was the antidote. Even if you haven't been on the web for over decade like The Onion, there may be a big within is easy reach.