Parallel Information Retrieval and Other Search Engine Goodness

Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan Büttcher, Google Inc and Charles L. A. Clarke, Gordon V. Cormack, both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines.    This text offers an introduction to the core topics underlying    modern search technologies, including algorithms, data structures,    indexing, retrieval, and evaluation. The emphasis is on  implementation    and experimentation; each chapter includes exercises and suggestions    for student projects.

Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty:

Information retrieval systems often have to deal with very large amounts of data. They must be  able to process many gigabytes or even terabytes of text, and to build and maintain an index  for millions of documents. To some extent the techniques discussed in Chapters 5-8 can help us  satisfy these requirements, but it is clear that, at some point, sophisticated data structures and  clever optimizations alone are not sufficient anymore. A single computer simply does not have  the computational power or the storage capabilities required for indexing even a small fraction  of the World Wide Web.
In this chapter we examine various ways of making information retrieval systems scale to very  large text collections such as the Web. The first part (Section 14.1) is concerned with parallel  query processing, where the search engine's service rate is increased by having multiple index  servers process incoming queries in parallel. It also discusses redundancy and fault tolerance  issues in distributed search engines. In the second second part (Section 14.2), we shift our  attention to the parallel execution of o¿-line tasks, such as index construction and statistical  analysis of a corpus of text. We explain the basics of MapReduce, a framework designed for  massively parallel computations carried out on large amounts of data.

Definitely worth a read right now and I look forward to the rest of the book.