Hi, I've created a Xapian db of CVs submitted to the company I work for (~800,000) to give a directory of ~6GB. Unfortunately some of the searches can take several seconds to complete, and this gets dramatically worse for concurrent queries. In terms of indexing I've just indexed everything -as we want a free-text system and I am using the Search::Xapian Perl package. Any ideas as to what I could do to speed things up? One potential problem is that the index is not really aged as old CVs may be updated as well as new ones. The machines are reasonably specced (dual 2.4Ghz Xeons with 5GB memory) TIA Jeremy -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
On Thu, Sep 29, 2005 at 10:39:38AM -0400, tech@dbx.co.uk wrote:> I've created a Xapian db of CVs submitted to the company I work for > (~800,000) to give a directory of ~6GB. Unfortunately some of the searches > can take several seconds to complete, and this gets dramatically worse for > concurrent queries.Are there particular types of query that are taking longer? eg: phrase searches, or more than a certain number of terms?> Any ideas as to what I could do to speed things up? One potential problem > is that the index is not really aged as old CVs may be updated as well as > new ones.You could get some improvements by searching against a compacted database, updating a non-compacted one and then regenerating the compacted one regularly.> The machines are reasonably specced (dual 2.4Ghz Xeons with 5GB memory)What OS? J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
Hi James, annoyingly it doesn't seem to be down to the complexity of the query -more to the frequency of term, a few stats : a query for the single term sales takes 25-20s wheras account manager,sales,telecoms takes 1s It'd on Debian Woody on 1 machine and Fedora Core 4 on the 64bit box. I'm trying the compaction today -using quartzcompact. Is it the case that I can update an index once compacted if I use the -n flag? Thanks Jeremy -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
On Thu, Sep 29, 2005 at 10:39:38AM -0400, tech@dbx.co.uk wrote:> I've created a Xapian db of CVs submitted to the company I work for > (~800,000) to give a directory of ~6GB. Unfortunately some of the searches > can take several seconds to complete, and this gets dramatically worse for > concurrent queries.The database size sounds rather large for 800K documents. Can you post an ls -l of the database directory to give us an idea which tables are large? Cheers, Olly
James, it could be the nub of the problem, as I'm not sure I understand how xapian works, but all I've got in the data is a number (the id of the CV in a MySQl database). The indexing process I go through is basically -get text from MySQL -> add each word as a term to a document-> add the id to the document as data -> add the modification time as a value -> bin the text (as the rest of the application (historically) uses the db). This means that all Xapian gives me back is a number. Just remembered, when I search I order the results by the modification time that I store as a value -maybe it's the sort? Ralf you did ask :) $db or $db = Search::Xapian::Database->new($database); $qp or $qp = new Search::Xapian::QueryParser($db); $qp->set_default_op(OP_AND); $qp or $qp = new Search::Xapian::QueryParser($db); $qp->set_default_op(OP_AND); print "Query String $term\n"; my $enq = $db->enquire($qp->parse_query($term)); printf "Parsing query '%s'\n", $enq->get_query()->get_description(); #Sort by mod time $enq->set_sorting(2,1); #This gets the data & the metadata my $mset = $enq->get_mset(0, $limit); if($ms_size > 0) { my $i = 0; while ($beginIt) { my $doc = $beginIt->get_document(); $results .= sprintf("%s\n" , $doc->get_data()); $i++; $i == $ms_size and last; ++$beginIt; } } else { $results = "Nothing found for " . $enq->get_query()->get_description() . "\ n"; } return $results; -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
Eureka! It's the sorting that does it -shows how it helps to talk to people even if it only gets me thinking. All I need to do now is work out how to get around the sorting, maybe I'll try doing it myself after I've got the results. Here's the 'ls -l' anyway. ls -l xapian_dirs total 6363884 -rw-r--r-- 1 jeremy staff 10 Sep 28 10:45 meta -rw-r--r-- 1 jeremy staff 0 Sep 28 10:45 position_DB -rw-r--r-- 1 jeremy staff 14 Sep 28 22:09 position_baseA -rw-r--r-- 1 jeremy staff 14 Sep 28 22:12 position_baseB -rw-r--r-- 1 jeremy staff 2959417344 Sep 28 22:12 postlist_DB -rw-r--r-- 1 jeremy staff 45178 Sep 28 22:09 postlist_baseA -rw-r--r-- 1 jeremy staff 45178 Sep 28 22:12 postlist_baseB -rw-r--r-- 1 jeremy staff 28377088 Sep 28 22:12 record_DB -rw-r--r-- 1 jeremy staff 450 Sep 28 22:09 record_baseA -rw-r--r-- 1 jeremy staff 428 Sep 28 22:12 record_baseB -rw-r--r-- 1 jeremy staff 3469115392 Sep 28 22:12 termlist_DB -rw-r--r-- 1 jeremy staff 52955 Sep 28 22:09 termlist_baseA -rw-r--r-- 1 jeremy staff 52955 Sep 28 22:12 termlist_baseB -rw-r--r-- 1 jeremy staff 53084160 Sep 28 22:12 value_DB -rw-r--r-- 1 jeremy staff 827 Sep 28 22:09 value_baseA -rw-r--r-- 1 jeremy staff 827 Sep 28 22:12 value_baseB Original Message: ----------------- From: Olly Betts olly@survex.com Date: Thu, 29 Sep 2005 16:49:23 +0100 To: xapian-discuss@lists.xapian.org, tech@dbx.co.uk Subject: Re: [Xapian-discuss] Long query times On Thu, Sep 29, 2005 at 04:36:37PM +0100, James Aylett wrote:> On Thu, Sep 29, 2005 at 04:32:51PM +0100, Olly Betts wrote: > > The database size sounds rather large for 800K documents. Can you post > > an ls -l of the database directory to give us an idea which tables are > > large? > > I'm guessing record. That's very much a finger in the air approach > though :-)A large record table shouldn't matter though - we'll only be reading the record table entries needed to show the search results, which is usually 10 or so. Unless Jeremy's code is reading the document data for all the matching documents... Cheers, Olly -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
I now remember why I didn't do this -what I want is all relevent CVs returned to me in date order, rather than textual relevency, as the later ones are of more interest than the earlier ones -even if the earlier ones score more highly. Is there a way to turn the relevence sorting off, so that it just did a simple match? If I did would the results come back in date order? Thanks again, Jeremy Original Message: ----------------- From: Olly Betts olly@survex.com Date: Thu, 29 Sep 2005 16:55:20 +0100 To: tech@dbx.co.uk, xapian-discuss@lists.xapian.org Subject: Re: [Xapian-discuss] Long query times> Just remembered, when I search I order the results by the modificationtime> that I store as a value -maybe it's the sort?Ah, that'll be the issue. Values aren't stored in a particularly efficient way considering how they actually get used nowadays (hindsight is 20-20). Flint will fix that... If you arrange to add documents in modification time order (and when updating a document delete and add it rather than replacing it) then you can just search ordered by reverse document id to get "sort by modification time". This is how the gmane search does "Sort by Date". It's not as fast as it could be (ideally we want to run the postlists backwards in this case) but it'll be faster than sorting on a value is ever likely to be. Cheers, Olly -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
On Fri, Sep 30, 2005 at 05:33:58AM -0400, tech@dbx.co.uk wrote:> Is there a way to turn the relevence sorting off, so that it just did a > simple match? If I did would the results come back in date order?That is what I was suggesting. If you set BoolWeight you get the documents back in docid order (or reverse docid order) which is the same as date order if you added them in date order. See (in particular the "Note:" paragraph): http://www.xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#a6 Cheers, Olly
Unfortunately these methods don't appear to be implemented in the Perl library. Jeremy Original Message: ----------------- From: Olly Betts olly@survex.com Date: Fri, 30 Sep 2005 15:00:02 +0100 To: tech@dbx.co.uk, xapian-discuss@lists.xapian.org Subject: Re: [Xapian-discuss] Long query times On Fri, Sep 30, 2005 at 05:33:58AM -0400, tech@dbx.co.uk wrote:> Is there a way to turn the relevence sorting off, so that it just did a > simple match? If I did would the results come back in date order?That is what I was suggesting. If you set BoolWeight you get the documents back in docid order (or reverse docid order) which is the same as date order if you added them in date order. See (in particular the "Note:" paragraph): http://www.xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#a6 Cheers, Olly -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
On Fri, Sep 30, 2005 at 12:13:41PM -0400, derbex@pop3.uklinux.net wrote:> Riiiight -can't say anything obvious leaps out of the documentation. The > only two methods labelled 'For compatibility with Xapian 0.8.5 and earlier' > in Enquire are set_sorting() -which is what I'm trying to get away from, > and set_sort_forward()?It's the latter - set_docid_order replaces set_sort_forward. It was rather poorly named before, and people tended to think it controlled the order when sorting on a value (which wasn't possible before). Old "true" is new ASCENDING, while "false" is DESCENDING. DONT_CARE is new in the new interface (conceptually it allows the matcher and backend to pick whichever order is most efficient to use but currently it always results in ASCENDING). Hmm, looking at the docs I realise that as well as saying they're for compatibility, we should point users at the replacement methods! I'll fix that... Cheers, Olly
Thanks for that Olly, I must be being thick here -but how do I set the weighting scheme to Boolean only though? The Weight classes aren't implemented in Perl. Cheers, Jeremy Original Message: ----------------- From: Olly Betts olly@survex.com Date: Fri, 30 Sep 2005 17:23:13 +0100 To: derbex@pop3.uklinux.net, xapian-discuss@lists.xapian.org Subject: Re: [Xapian-discuss] Long query times On Fri, Sep 30, 2005 at 12:13:41PM -0400, derbex@pop3.uklinux.net wrote:> Riiiight -can't say anything obvious leaps out of the documentation. The > only two methods labelled 'For compatibility with Xapian 0.8.5 andearlier'> in Enquire are set_sorting() -which is what I'm trying to get away from, > and set_sort_forward()?It's the latter - set_docid_order replaces set_sort_forward. It was rather poorly named before, and people tended to think it controlled the order when sorting on a value (which wasn't possible before). Old "true" is new ASCENDING, while "false" is DESCENDING. DONT_CARE is new in the new interface (conceptually it allows the matcher and backend to pick whichever order is most efficient to use but currently it always results in ASCENDING). Hmm, looking at the docs I realise that as well as saying they're for compatibility, we should point users at the replacement methods! I'll fix that... Cheers, Olly _______________________________________________ Xapian-discuss mailing list Xapian-discuss@lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
Thanks Olly & Marcus -that's done it, performance is excellent now. Jeremy Original Message: ----------------- From: Olly Betts olly@survex.com Date: Mon, 3 Oct 2005 14:00:45 +0100 To: tech@dbx.co.uk, xapian-discuss@lists.xapian.org Subject: Re: [Xapian-discuss] Long query times On Mon, Oct 03, 2005 at 05:26:06AM -0400, tech@dbx.co.uk wrote:> how do I set the weighting scheme to Boolean only though? > The Weight classes aren't implemented in Perl.They were added in Search::Xapian 0.9.2.2, thanks to a patch from Marcus Ramberg. You probably need to upgrade. Cheers, Olly -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .