A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way

How does the Wayback Machine work? Now with over 400 billion webpages indexed, allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer.

Here's some information from a thread on Hacker News. It starts with mmagin, a former Archive employee:

I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it.For the purposes of making the wayback machine go:
  • Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread across a lot of commodity grade machines.
  • An sorted index of all the content was built that would let you lookup (url) and give you a list of times or (url, time) to (filename, file-offset). It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes. Binary search across a sorted text file is surprisingly fast -- in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently.
  • (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine. Then it would use a little mechanism (network broadcast maybe?) to find out what server that (unique) filename was on, then it would request the particular record from that server.
  • (Edit: FYI, my knowledge is 5 years old now. I know they've done some things to keep the index more current than they did back then.)At the very least, I'd think about getting your blobs out of MySQL and putting them in the filesystem. Filesystems are good at this stuff. You can certainly do something as simple as a SHA-1 hash of the content as the filename, and then depending on your filesystem's performance characteristics, you can have a couple levels in the tree you store them in. e.g. da39a3ee5e6b4b0d3255bfef95601890afd80709 goes into the directory da/39/ Then you stick da39a3ee5e6b4b0d3255bfef95601890afd80709 into the 'pointer' field in your table that replaces the actual data. Obviously this design assumes the content of _that_ file doesn't change. If you want to change the data for that row in the table, you have to write a new file in the filesystem and update the 'pointer'.

To which sam and raj of the Internet Archive replied:

Thanks! We were writing up a response at the same time:The Wayback Machine data is stored in WARC or ARC files[0] which are written at web crawl time by the Heritrix crawler[1] (or other crawlers) and stored as regular files in the archive.org storage cluster.

Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records[2]. The first level fits in core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a pointer to larger CDX block.

Index lookup works by binary searching the first level list stored in core, then HTTP range-request loading the appropriate second-level blocks from the CDX index. Finally, web page data is loaded by range-requesting WARC data pointed to by the CDX records. Before final output, link re-writing and other transforms are applied to make playback work correctly in the browser.

The server stack:
  • frontend: Tengine + HAProxy to a pool of Wayback tomcat app servers[3]
  • backend: The redis-backed archive.org metadata API[4] for object location and nginx on linux (via ext4) for data service
  • [0] http://en.wikipedia.org/wiki/Web_ARChive
  • [1] https://github.com/internetarchive/heritrix3
  • [2] https://github.com/internetarchive/CDX-Writer
  • [3] https://github.com/internetarchive/wayback
  • [4] http://blog.archive.org/2013/07/04/metadata-api/

sytelus asked: Why not use a hashtable instead of binary search?

gojomo replied:

Former Archive employee (& still occasional contract contributor) here. This was one of my 1st questions when joining in 2003!
Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.
Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)
Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.
There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.
BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php

An interesting question from Vecrios: I still cannot fathom how they are able to store huge amounts of data and not run out of space?

dwhly answers:

From a conversation with Brewster a few years ago: The doubling of density of disk drives has allowed them to stay relatively neutral with respect to space for the Wayback machine. It still occupies approximately the same size as it has for the last 10 years, which is essentially a set of racks about 15-20 feet long altogether I think?
However, the new TV news and search capability requires substantially more space than even the archive IIRC, or certainly is heading that way.

And thanks to rietta for saying how the Wayback Machine stores more pages than stars in the Milky Way Galaxy. A fabulous image.