thr3ads.net - Xapian discuss - [Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ? [Aug 2011]

If this information is useful, please help other people find it:
Share via:

makao009

2011-Aug-09 16:48 UTC

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

what is the fastest way to fetch results which are sorted by timestamp ?


i want to use xapian as my search engine , use add_boolean_term(something) and
add_value(0,sortable_serialise(get_timestamp())) to a doc.


search through enquire.set_weighting_scheme(xapian.BoolWeight()) and
enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the
timestamp.


This method is ok , but is there a faster way to do that ? Since i have millions
of records .

Richard Boulton

2011-Aug-09 17:04 UTC

head link

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

On 9 August 2011 17:48, makao009 <makao009 at 126.com>
wrote:> what is the fastest way to fetch results which are sorted by timestamp ?
The fastest possible way is to have your index sorted by timestamp
(ie, such that document IDs increase as the timestamp increases).
That way, the search can stop as soon as sufficient matches have been
found.  It can be very awkward to get an index in such order though,
particularly in the face of updates, assuming that you want the sort
order to show most recent first.
> i want to use xapian as my search engine , use add_boolean_term(something)
and add_value(0,sortable_serialise(get_timestamp())) to a doc.
> search through enquire.set_weighting_scheme(xapian.BoolWeight()) and
enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the
timestamp.
That's another approach, certainly.
> This method is ok , but is there a faster way to do that ? Since i have
millions of records .
Sorting the database, or some variant of that, is the way to get
really fast sorted results.

There's a variation I experimented with using Xappy, involving sorting
as much of the database as possible, keeping track of the range of
document IDs for which the values were sorted, and using a custom
PostingSource to take advantage of that knowledge to skip past the
document IDs which were known to be at too low a value.  This worked
pretty well (not quite as fast as using a fully sorted database), but
is quite fiddly to maintain the ordering (and you need to use a custom
PostingSource, so if you're using one of the language bindings, you'd
need to compile your own custom Xapian).

-- 
Richard

Tim Brody

2011-Aug-10 10:39 UTC

head link

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

Hi,

In terms of the enquiry, do you mean this?:
set_weighting_scheme(Xapian::BoolWeight());
set_docid_order(Xapian::Enquire::DESCENDING);

What's the most efficient process to build multiple Xapian indexes? Can
the "relevance" index provide any hints to building the sorted
indexes?

Cheers,
Tim.

On Tue, 2011-08-09 at 18:04 +0100, Richard Boulton
wrote:> On 9 August 2011 17:48, makao009 <makao009 at 126.com> wrote:
> > what is the fastest way to fetch results which are sorted by timestamp
?
> 
> The fastest possible way is to have your index sorted by timestamp
> (ie, such that document IDs increase as the timestamp increases).
> That way, the search can stop as soon as sufficient matches have been
> found.  It can be very awkward to get an index in such order though,
> particularly in the face of updates, assuming that you want the sort
> order to show most recent first.
> 
> > i want to use xapian as my search engine , use
add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp()))
to a doc.
> > search through enquire.set_weighting_scheme(xapian.BoolWeight()) and
enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the
timestamp.
> 
> That's another approach, certainly.
> 
> > This method is ok , but is there a faster way to do that ? Since i
have millions of records .
> 
> Sorting the database, or some variant of that, is the way to get
> really fast sorted results.
> 
> There's a variation I experimented with using Xappy, involving sorting
> as much of the database as possible, keeping track of the range of
> document IDs for which the values were sorted, and using a custom
> PostingSource to take advantage of that knowledge to skip past the
> document IDs which were known to be at too low a value.  This worked
> pretty well (not quite as fast as using a fully sorted database), but
> is quite fiddly to maintain the ordering (and you need to use a custom
> PostingSource, so if you're using one of the language bindings,
you'd
> need to compile your own custom Xapian).
>

Tim Brody

2011-Aug-11 15:29 UTC

head link

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

On Thu, 2011-08-11 at 12:17 +0100, Richard Boulton
wrote:> On 11 August 2011 11:18, Henry C. <henka at cityweb.co.za> wrote:
> > It's a real pity xapian-compact doesn't have a --sort-by-value
argument to
> > perform post-indexing basic sorting of some kind.
> >
> > Is something like this even possible (by that I mean a change to
> > xapian-compact code)?
> 
> It's not really possibly to do sorting (or other reordering of docids)
> during the process that xapian-compact performs; it's working at a
> lower level than that, stitching chunks of postlists together without
> actually interpreting their contents.
> Other than that being implemented, to sort the database you really
> need to work at pretty much the level of the xapian database API; ie,
> implement something more like the copydatabase tool, which copies
> documents in the new order.  I've written code (in python) to sort
> databases using this method in the past - which worked ok for a few
> million documents, but isn't particularly efficient.  I don't have
the
> rights to distribute that code, but it was pretty simple.  If I
> remember correctly, it pulled the values to sort by into a numpy
> array, and used one of numpy's functions to produce a mapping from the
> old docid to the new docid, and then just ran through the old database
> reading documents and writing them to the new database in the correct
> position.
Out of curiosity, if you left a gap between every docid will Xapian
maintain an efficient index if you re-insert documents?
e.g.

a. 10 - 2010-05-01
b. 20 - 2010-06-01
c. 30 - 2010-07-01

Then at a later date you re-index c. as 2010-05-15 by giving it an
intermediate docid:
a. 10 - 2010-05-01
c. 15 - 2010-05-15
b. 30 - 2010-06-01

So maintaining a sorted index becomes an exercise in defragmenting
rather than building an entire new DB whenever a document's ranking
increases?

An annoying problem only solvable by massive duplication (although my
inexpert view would be ASC/DESC should be doable on a single index???).

Cheers,
Tim.

Maybe Matching Threads

Search for more possibly parallel threads

Xapian discuss - Aug 2011 - what is the fastest way to fetch results which are sorted by timestamp ?

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

Maybe Matching Threads