thr3ads.net - Xapian discuss - [Xapian-discuss] Quickest way to retrieve data for a large match set? [Jun 2010]

If this information is useful, please help other people find it:
Share via:

William Crawford

2010-Jun-24 11:55 UTC

[Xapian-discuss] Quickest way to retrieve data for a large match set?

We're using the Perl binding to access Xapian in a simple search of image 
metadata (title and keywords). Due to the specification for the search engine, 
by default we have to sort the results using a function of the search rank, 
age (well, newness) and popularity (rated by sales of the image). As a result, 
we have to fetch the complete result set and then calculate a new ranking 
based on the original rank, perturbed using the ratios of each of the newness 
and popularity to the highest values in the result set (i.e. there is no way 
to precalculate these at indexing time, alas).

Currently fetching the document data for the results has become something of a 
bottleneck (typical searches my generate 50 - 500 matches, but some return 
more than 5000).

Code is something like:

...
    print STDERR "Query = ", $q->get_description, "\n" if
$self->debug;
    my $e = $self->index->enquire ($q);
    #my $hits = $e->get_mset(0, $self->index->get_doccount,
$self->index->get_doccount);    my (@hits) = $e->matches (0, $self->index->get_doccount,
$self->index->get_doccount);    my (@results) = map +thaw($_->get_document->get_data), @hits;
    return \@results;
}

I'd like to know if there's anything I can do to improve the speed of
fetching
the results (in other words, am I doing it wrong)?

Olly Betts

2010-Jun-25 12:51 UTC

head link

[Xapian-discuss] Quickest way to retrieve data for a large match set?

On Thu, Jun 24, 2010 at 12:55:09PM +0100, William Crawford
wrote:> We're using the Perl binding to access Xapian in a simple search of
image
> metadata (title and keywords). Due to the specification for the search
engine,
> by default we have to sort the results using a function of the search rank,
> age (well, newness) and popularity (rated by sales of the image). As a
result,
> we have to fetch the complete result set and then calculate a new ranking 
> based on the original rank, perturbed using the ratios of each of the
newness
> and popularity to the highest values in the result set (i.e. there is no
way
> to precalculate these at indexing time, alas).
It would be more efficient to tell Xapian about the contributions from age and
popularity so it can produce the ranking you actually want.

You can do this in Xapian 1.2 by subclassing Xapian::PostingSource which allows
extra weight contributions to be added in dynamically, but this isn't yet
wrapped for use from Perl.
> Currently fetching the document data for the results has become something
of a
> bottleneck (typical searches my generate 50 - 500 matches, but some return 
> more than 5000).
> 
> Code is something like:
> 
> ...
>     print STDERR "Query = ", $q->get_description,
"\n" if $self->debug;
>     my $e = $self->index->enquire ($q);
>     #my $hits = $e->get_mset(0, $self->index->get_doccount,
$self->index-
> >get_doccount);
>     my (@hits) = $e->matches (0, $self->index->get_doccount,
$self->index-
> >get_doccount);
>     my (@results) = map +thaw($_->get_document->get_data), @hits;
>     return \@results;
> }
> 
> I'd like to know if there's anything I can do to improve the speed
of fetching
> the results (in other words, am I doing it wrong)?
I wouldn't recommend this approach since by insisting on Xapian finding
all the matches, you're hampering the optimisations it can use.  That's
why we added Xapian::PostingSource - it allows you to perform the
equivalent of many post-processing tricks during the match.

But right now, that doesn't help if you want to use Perl.

If you're just putting an external numeric id in the document data, you
could use that as the document id for Xapian instead, which would avoid
the need to call get_document() and get_data() for every match.  Or
if you are looking up the popularity in an external database, then you
could key that lookup on Xapian's docid.

Cheers,
    Olly

Possibly Parallel Threads

Search for more seemingly similar threads

Xapian discuss - Jun 2010 - Quickest way to retrieve data for a large match set?

[Xapian-discuss] Quickest way to retrieve data for a large match set?

[Xapian-discuss] Quickest way to retrieve data for a large match set?

Possibly Parallel Threads