We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle).
The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable.
Could you point me to some examples or articles I could review to design a
solution for such this context?
Aren't all search options painful?
It seems that way anyway. You work so hard on getting your site up and working and then there's this giant search problem to solve that seems as big as everything you've already done. Unfortunately, I don't think there's a way out of that pain for large dynamic sites. :-(
Keeping searching away from your main database is the way to go IMHO. You want your database doing the work only it can do, transactions. So loading it and blowing caches for searches might waste your precious database resources.
You could have the indexer work off read-only slaves. That would isolate the load, but it wouldn't necessarily be real-time.
Here's a good discussion of using Lucene for real-time updates at http://www.gossamer-threads.com/lists/lucene/java-user/51517. Seems to be a lot of interesting issues (batching, garbage collection) around making Lucene update indexes quickly, but it seems possible.
I also wonder if the Google Custom Search engine at http://google.com/coop/cse/ might be an option? If you kept a parallel tree of documents Google searched then Google would probably search faster than your traditional options. There's also an API available when using the for pay versions. What I don't know is how fast they would respond to changes. They have this linked CSE product now and that might do the trick. It's worth a look anyway.
Post new comment