Job queue and search engine

Hi,

I want to implement a search engine with lucene.
To be scalable, I would like to execute search jobs asynchronously (with a job queuing system).

But i don't know if it is a good design... Why ?

Search results can be large ! (eg: 100+ pages with 25 documents per page)
With asynchronous sytem, I need to store results for each search job.
I can set a short expiration time (~5 min) for each search result, but it's still large.

What do you think about it ?
Which design would you use for that ?

Thanks
Mat

Re: Job queue and search engine

If you have 25 docs per page the results max is 25. Simply add paging (well not really paging just a next/prev link). Also shouldn't the results be simply ID's with a human readable very short or short summary and not complete documents?

Re: Job queue and search engine

Thanks for your suggest !
Yes, each document is a brief description and 1 ID, so a single document is very light.

With my initial solution : 1 search job == all results
The search job is executed 1 time, and the paging system show only parts of the results.
The problem is : 25 document * X pages * Y users may be huge.

With your solution : 1 search job == results for 1 page.
One search job is executed for each page view.
In term of storage, your solution is good and lightweight.
But the risk is : navigation between page may be slow because it generate more jobs.

In term of CPU, according to the lucene FAQ :
* How do I implement paging, i.e. showing result from 1-10, 11-20 etc?
-> Just re-execute the search and ignore the hits you don't want to show. As people usually look only at the first results this approach is usually fast enough.

Each page view will generate the same job to the lucene engine.
It may be slower for the user experience and more "cpu intensive" for the job search workers.

I will try an intermediate approach :

1 job search will generate results for 5 pages.
Each search result will be cached (memcache).
When user ask for page 4 it wil generate a new search job in background for page 5-10 (if not already in cache of course).

In term of cpu and memory usage, it will be lighter than the 2 previous solutions.
What do you think about it ? Do you see any other possible improvements ?
Do you think i'm paranoid in term of storage ? ;)

Re: Job queue and search engine

SERPS have to be faster in order to release effective results otherwise it will lose its priority
-----
sea plants
sea grapes...plant roots

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?><h1 ?=?><h2 ?=?><h3 ?=?>
  • Lines and paragraphs break automatically.
  • Glossary terms will be automatically marked with links to their descriptions
  • You may link to webpages through the weblinks registry

More information about formatting options

To combat spam, please enter the code in the image.