thr3ads.net - Xapian discuss - [Xapian-discuss] Design-question/problem [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Carsten Reimer

2009-Aug-28 11:59 UTC

[Xapian-discuss] Design-question/problem

Dear list,

we are using the Xapian-python bindings to build some fulltext search 
engine for some 400+ books each about 300+ pages.

We have the need to be able to limit the search on one book or on 
several selected books as well as to be able to search all of them.

To be able to do so we decided to create one Xapian-database for each 
book and build the databases we need to search for the different use 
cases described above dynamically.

The flow is as follows.

We provide the paths to the per book xapaian-databases to a function 
that is bulding our search database. From this we build the 
query_parser-objects which in turn we use to create the query-objects 
from which the Enquiry-objects are finally build.

Having 400+ single xapian-databases we found that when searching all of 
the books it took a lot of time (and file descriptors) to open all the 
database files for each book while building the search database (and 
appearently at least with the python-bindings the file descriptors are 
kept open during the livetime of the database-object).

This behaviour increases response times of our application (a 
django-based web app) dramatically. So we decided to keep that large 
search database for all books in memory by creating it when django is 
started. This works well but unfortunately influences scalability in a 
very negative way. Stress testing the app using Apache bench with really 
low numbers of requests and concurrency (100/4) leads to  erroneous 
responses due to the fact that not enough backend processes/threads 
(django behind lighttpd) could be provided in time due to the in memory 
search database (disabling this in memory search database 5000/200 was 
no problem).

So we are a little unsure if we used Xapian the wrong way or Xapian may 
not be suitable for our needs.

Any ideas, hints, whatever are warmly welcome.

Thanks in advance

with best regards

Carsten Reimer


-- 
Carsten Reimer
Web Developer
carsten.reimer at galileo-press.de
Phone +49.228.42150.73

Galileo Press GmbH
Rheinwerkallee 4 - 53227 Bonn - Germany
Phone +49.228.42150.0 (Zentrale) .77 (Fax)
http://www.galileo-press.de/

Managing Directors: Tomas Wehren, Ralf Kaulisch, Rainer Kaltenecker
HRB 8363 Amtsgericht Bonn

John Leach

2009-Aug-28 12:23 UTC

head link

[Xapian-discuss] Design-question/problem

On Fri, 2009-08-28 at 13:59 +0200, Carsten Reimer wrote:
> we are using the Xapian-python bindings to build some fulltext search 
> engine for some 400+ books each about 300+ pages.
> 
> We have the need to be able to limit the search on one book or on 
> several selected books as well as to be able to search all of them.
> Any ideas, hints, whatever are warmly welcome.
Hi Carsten,

one database for all books is likely the way to go.  A quick solution
would be to ensure a unique id is added as a token for each book.  When
you want to confine a search to one book, ensure that token is included
in the search.

Quick example, if each book has a unique id of the form "book88", then
the search could look like:

"book88 AND (quick brown fox)"

or perhaps: "+book88 quick brown fox"

I'd probably personally do this with term prefixes, but that takes a bit
more work to setup - it does makes sure that books that happen to have
the term book88 in it somewhere don't turn up in the search :)  With
term prefixes the search might look something like this:

"+book_id:88 quick brown fox"

I believe that the most efficient way to do the query would actually be
to use the OP_FILTER query on the book id term - not sure if you can do
this via the query parser, so you'd have to build that yourself.

Hope this is useful,

John.
-- 
http://johnleach.co.uk
http://www.brightbox.co.uk

James Aylett

2009-Aug-28 12:43 UTC

head link

[Xapian-discuss] Design-question/problem

On Fri, Aug 28, 2009 at 01:59:21PM +0200, Carsten Reimer wrote:
> we are using the Xapian-python bindings to build some fulltext search 
> engine for some 400+ books each about 300+ pages.
> We have the need to be able to limit the search on one book or on 
> several selected books as well as to be able to search all of them.
> 
> To be able to do so we decided to create one Xapian-database for each 
> book and build the databases we need to search for the different use 
> cases described above dynamically.
Hi, Carsten. As John points out, another way to approach this is to
use a single database, and to add a single term to each document,
identifying the book it came from. John mentions prefixes, but I
thought I'd provide some sample (if untested) code to try to explain
them a little.

----------------------------------------------------------------------
boolean_terms = [ 'XB1', 'XB2' ]
qp = xapian.QueryParser()
# ... configuration of qp (eg: stemming, prefixes)
p_query = qp.parse_query(request.GET.get('q', ''))
b_query = xapian.Query(xapian.Query.OP_AND, boolean_terms)
query = xapian.Query(xapian.Query.OP_FILTER, p_query, b_query)
----------------------------------------------------------------------

Where XB1, XB2 are identifying books 1 and 2 respectively. (You can
choose your own prefix; see
<http://xapian.org/docs/omega/termprefixes.html>.)

OP_FILTER only uses the left query (the p_query, which your users
typed in) for weights, but otherwise behaves like OP_AND, so it will
require all the terms in b_query to match for a document to be
returned.

J

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org

Xapian discuss - Aug 2009 - Design-question/problem

[Xapian-discuss] Design-question/problem

[Xapian-discuss] Design-question/problem

[Xapian-discuss] Design-question/problem