William Crawford
2010-Jun-24 11:55 UTC
[Xapian-discuss] Quickest way to retrieve data for a large match set?
We're using the Perl binding to access Xapian in a simple search of image metadata (title and keywords). Due to the specification for the search engine, by default we have to sort the results using a function of the search rank, age (well, newness) and popularity (rated by sales of the image). As a result, we have to fetch the complete result set and then calculate a new ranking based on the original rank, perturbed using the ratios of each of the newness and popularity to the highest values in the result set (i.e. there is no way to precalculate these at indexing time, alas). Currently fetching the document data for the results has become something of a bottleneck (typical searches my generate 50 - 500 matches, but some return more than 5000). Code is something like: ... print STDERR "Query = ", $q->get_description, "\n" if $self->debug; my $e = $self->index->enquire ($q); #my $hits = $e->get_mset(0, $self->index->get_doccount, $self->index->get_doccount);my (@hits) = $e->matches (0, $self->index->get_doccount, $self->index->get_doccount);my (@results) = map +thaw($_->get_document->get_data), @hits; return \@results; } I'd like to know if there's anything I can do to improve the speed of fetching the results (in other words, am I doing it wrong)?
Olly Betts
2010-Jun-25 12:51 UTC
[Xapian-discuss] Quickest way to retrieve data for a large match set?
On Thu, Jun 24, 2010 at 12:55:09PM +0100, William Crawford wrote:> We're using the Perl binding to access Xapian in a simple search of image > metadata (title and keywords). Due to the specification for the search engine, > by default we have to sort the results using a function of the search rank, > age (well, newness) and popularity (rated by sales of the image). As a result, > we have to fetch the complete result set and then calculate a new ranking > based on the original rank, perturbed using the ratios of each of the newness > and popularity to the highest values in the result set (i.e. there is no way > to precalculate these at indexing time, alas).It would be more efficient to tell Xapian about the contributions from age and popularity so it can produce the ranking you actually want. You can do this in Xapian 1.2 by subclassing Xapian::PostingSource which allows extra weight contributions to be added in dynamically, but this isn't yet wrapped for use from Perl.> Currently fetching the document data for the results has become something of a > bottleneck (typical searches my generate 50 - 500 matches, but some return > more than 5000). > > Code is something like: > > ... > print STDERR "Query = ", $q->get_description, "\n" if $self->debug; > my $e = $self->index->enquire ($q); > #my $hits = $e->get_mset(0, $self->index->get_doccount, $self->index- > >get_doccount); > my (@hits) = $e->matches (0, $self->index->get_doccount, $self->index- > >get_doccount); > my (@results) = map +thaw($_->get_document->get_data), @hits; > return \@results; > } > > I'd like to know if there's anything I can do to improve the speed of fetching > the results (in other words, am I doing it wrong)?I wouldn't recommend this approach since by insisting on Xapian finding all the matches, you're hampering the optimisations it can use. That's why we added Xapian::PostingSource - it allows you to perform the equivalent of many post-processing tricks during the match. But right now, that doesn't help if you want to use Perl. If you're just putting an external numeric id in the document data, you could use that as the document id for Xapian instead, which would avoid the need to call get_document() and get_data() for every match. Or if you are looking up the popularity in an external database, then you could key that lookup on Xapian's docid. Cheers, Olly