« Scaling Secret #2: Denormalizing Your Way to Speed and Profit | Main | Lots of questions for high scalability / high availability »

How do we make a large real-time search engine?

We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle).

The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable.

Could you point me to some examples or articles I could review to design a
solution for such this context?

Reader Comments (1)

It seems that way anyway. You work so hard on getting your site up and working and then there's this giant search problem to solve that seems as big as everything you've already done. Unfortunately, I don't think there's a way out of that pain for large dynamic sites. :-(

Keeping searching away from your main database is the way to go IMHO. You want your database doing the work only it can do, transactions. So loading it and blowing caches for searches might waste your precious database resources.

You could have the indexer work off read-only slaves. That would isolate the load, but it wouldn't necessarily be real-time.

Here's a good discussion of using http://lucene.apache.org">Lucene for real-time updates at http://www.gossamer-threads.com/lists/lucene/java-user/51517. Seems to be a lot of interesting issues (batching, garbage collection) around making Lucene update indexes quickly, but it seems possible.

I also wonder if the Google Custom Search engine at http://google.com/coop/cse/ might be an option? If you kept a parallel tree of documents Google searched then Google would probably search faster than your traditional options. There's also an API available when using the for pay versions. What I don't know is how fast they would respond to changes. They have this linked CSE product now and that might do the trick. It's worth a look anyway.

November 29, 1990 | Unregistered CommenterTodd Hoff

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>