Dear list, we are using the Xapian-python bindings to build some fulltext search engine for some 400+ books each about 300+ pages. We have the need to be able to limit the search on one book or on several selected books as well as to be able to search all of them. To be able to do so we decided to create one Xapian-database for each book and build the databases we need to search for the different use cases described above dynamically. The flow is as follows. We provide the paths to the per book xapaian-databases to a function that is bulding our search database. From this we build the query_parser-objects which in turn we use to create the query-objects from which the Enquiry-objects are finally build. Having 400+ single xapian-databases we found that when searching all of the books it took a lot of time (and file descriptors) to open all the database files for each book while building the search database (and appearently at least with the python-bindings the file descriptors are kept open during the livetime of the database-object). This behaviour increases response times of our application (a django-based web app) dramatically. So we decided to keep that large search database for all books in memory by creating it when django is started. This works well but unfortunately influences scalability in a very negative way. Stress testing the app using Apache bench with really low numbers of requests and concurrency (100/4) leads to erroneous responses due to the fact that not enough backend processes/threads (django behind lighttpd) could be provided in time due to the in memory search database (disabling this in memory search database 5000/200 was no problem). So we are a little unsure if we used Xapian the wrong way or Xapian may not be suitable for our needs. Any ideas, hints, whatever are warmly welcome. Thanks in advance with best regards Carsten Reimer -- Carsten Reimer Web Developer carsten.reimer at galileo-press.de Phone +49.228.42150.73 Galileo Press GmbH Rheinwerkallee 4 - 53227 Bonn - Germany Phone +49.228.42150.0 (Zentrale) .77 (Fax) http://www.galileo-press.de/ Managing Directors: Tomas Wehren, Ralf Kaulisch, Rainer Kaltenecker HRB 8363 Amtsgericht Bonn
On Fri, 2009-08-28 at 13:59 +0200, Carsten Reimer wrote:> we are using the Xapian-python bindings to build some fulltext search > engine for some 400+ books each about 300+ pages. > > We have the need to be able to limit the search on one book or on > several selected books as well as to be able to search all of them.> Any ideas, hints, whatever are warmly welcome.Hi Carsten, one database for all books is likely the way to go. A quick solution would be to ensure a unique id is added as a token for each book. When you want to confine a search to one book, ensure that token is included in the search. Quick example, if each book has a unique id of the form "book88", then the search could look like: "book88 AND (quick brown fox)" or perhaps: "+book88 quick brown fox" I'd probably personally do this with term prefixes, but that takes a bit more work to setup - it does makes sure that books that happen to have the term book88 in it somewhere don't turn up in the search :) With term prefixes the search might look something like this: "+book_id:88 quick brown fox" I believe that the most efficient way to do the query would actually be to use the OP_FILTER query on the book id term - not sure if you can do this via the query parser, so you'd have to build that yourself. Hope this is useful, John. -- http://johnleach.co.uk http://www.brightbox.co.uk
On Fri, Aug 28, 2009 at 01:59:21PM +0200, Carsten Reimer wrote:> we are using the Xapian-python bindings to build some fulltext search > engine for some 400+ books each about 300+ pages. > We have the need to be able to limit the search on one book or on > several selected books as well as to be able to search all of them. > > To be able to do so we decided to create one Xapian-database for each > book and build the databases we need to search for the different use > cases described above dynamically.Hi, Carsten. As John points out, another way to approach this is to use a single database, and to add a single term to each document, identifying the book it came from. John mentions prefixes, but I thought I'd provide some sample (if untested) code to try to explain them a little. ---------------------------------------------------------------------- boolean_terms = [ 'XB1', 'XB2' ] qp = xapian.QueryParser() # ... configuration of qp (eg: stemming, prefixes) p_query = qp.parse_query(request.GET.get('q', '')) b_query = xapian.Query(xapian.Query.OP_AND, boolean_terms) query = xapian.Query(xapian.Query.OP_FILTER, p_query, b_query) ---------------------------------------------------------------------- Where XB1, XB2 are identifying books 1 and 2 respectively. (You can choose your own prefix; see <http://xapian.org/docs/omega/termprefixes.html>.) OP_FILTER only uses the left query (the p_query, which your users typed in) for weights, but otherwise behaves like OP_AND, so it will require all the terms in b_query to match for a document to be returned. J -- James Aylett talktorex.co.uk - xapian.org - uncertaintydivision.org