How do we make a large real-time search engine?

We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle).

The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable.

Could you point me to some examples or articles I could review to design a
solution for such this context?

Todd Hoff's picture

Aren't all search options painful?

It seems that way anyway. You work so hard on getting your site up and working and then there's this giant search problem to solve that seems as big as everything you've already done. Unfortunately, I don't think there's a way out of that pain for large dynamic sites. :-(

Keeping searching away from your main database is the way to go IMHO. You want your database doing the work only it can do, transactions. So loading it and blowing caches for searches might waste your precious database resources.

You could have the indexer work off read-only slaves. That would isolate the load, but it wouldn't necessarily be real-time.

Here's a good discussion of using Lucene for real-time updates at http://www.gossamer-threads.com/lists/lucene/java-user/51517. Seems to be a lot of interesting issues (batching, garbage collection) around making Lucene update indexes quickly, but it seems possible.

I also wonder if the Google Custom Search engine at http://google.com/coop/cse/ might be an option? If you kept a parallel tree of documents Google searched then Google would probably search faster than your traditional options. There's also an API available when using the for pay versions. What I don't know is how fast they would respond to changes. They have this linked CSE product now and that might do the trick. It's worth a look anyway.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?><h1 ?=?><h2 ?=?><h3 ?=?>
  • Lines and paragraphs break automatically.
  • Glossary terms will be automatically marked with links to their descriptions
  • You may link to webpages through the weblinks registry

More information about formatting options

To combat spam, please enter the code in the image.