makao009
2011-Aug-09 16:48 UTC
[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?
what is the fastest way to fetch results which are sorted by timestamp ? i want to use xapian as my search engine , use add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp())) to a doc. search through enquire.set_weighting_scheme(xapian.BoolWeight()) and enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the timestamp. This method is ok , but is there a faster way to do that ? Since i have millions of records .
Richard Boulton
2011-Aug-09 17:04 UTC
[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?
On 9 August 2011 17:48, makao009 <makao009 at 126.com> wrote:> what is the fastest way to fetch results which are sorted by timestamp ?The fastest possible way is to have your index sorted by timestamp (ie, such that document IDs increase as the timestamp increases). That way, the search can stop as soon as sufficient matches have been found. It can be very awkward to get an index in such order though, particularly in the face of updates, assuming that you want the sort order to show most recent first.> i want to use xapian as my search engine , use add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp())) to a doc. > search through enquire.set_weighting_scheme(xapian.BoolWeight()) and enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the timestamp.That's another approach, certainly.> This method is ok , but is there a faster way to do that ? Since i have millions of records .Sorting the database, or some variant of that, is the way to get really fast sorted results. There's a variation I experimented with using Xappy, involving sorting as much of the database as possible, keeping track of the range of document IDs for which the values were sorted, and using a custom PostingSource to take advantage of that knowledge to skip past the document IDs which were known to be at too low a value. This worked pretty well (not quite as fast as using a fully sorted database), but is quite fiddly to maintain the ordering (and you need to use a custom PostingSource, so if you're using one of the language bindings, you'd need to compile your own custom Xapian). -- Richard
Tim Brody
2011-Aug-10 10:39 UTC
[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?
Hi, In terms of the enquiry, do you mean this?: set_weighting_scheme(Xapian::BoolWeight()); set_docid_order(Xapian::Enquire::DESCENDING); What's the most efficient process to build multiple Xapian indexes? Can the "relevance" index provide any hints to building the sorted indexes? Cheers, Tim. On Tue, 2011-08-09 at 18:04 +0100, Richard Boulton wrote:> On 9 August 2011 17:48, makao009 <makao009 at 126.com> wrote: > > what is the fastest way to fetch results which are sorted by timestamp ? > > The fastest possible way is to have your index sorted by timestamp > (ie, such that document IDs increase as the timestamp increases). > That way, the search can stop as soon as sufficient matches have been > found. It can be very awkward to get an index in such order though, > particularly in the face of updates, assuming that you want the sort > order to show most recent first. > > > i want to use xapian as my search engine , use add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp())) to a doc. > > search through enquire.set_weighting_scheme(xapian.BoolWeight()) and enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the timestamp. > > That's another approach, certainly. > > > This method is ok , but is there a faster way to do that ? Since i have millions of records . > > Sorting the database, or some variant of that, is the way to get > really fast sorted results. > > There's a variation I experimented with using Xappy, involving sorting > as much of the database as possible, keeping track of the range of > document IDs for which the values were sorted, and using a custom > PostingSource to take advantage of that knowledge to skip past the > document IDs which were known to be at too low a value. This worked > pretty well (not quite as fast as using a fully sorted database), but > is quite fiddly to maintain the ordering (and you need to use a custom > PostingSource, so if you're using one of the language bindings, you'd > need to compile your own custom Xapian). >
Tim Brody
2011-Aug-11 15:29 UTC
[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?
On Thu, 2011-08-11 at 12:17 +0100, Richard Boulton wrote:> On 11 August 2011 11:18, Henry C. <henka at cityweb.co.za> wrote: > > It's a real pity xapian-compact doesn't have a --sort-by-value argument to > > perform post-indexing basic sorting of some kind. > > > > Is something like this even possible (by that I mean a change to > > xapian-compact code)? > > It's not really possibly to do sorting (or other reordering of docids) > during the process that xapian-compact performs; it's working at a > lower level than that, stitching chunks of postlists together without > actually interpreting their contents.> Other than that being implemented, to sort the database you really > need to work at pretty much the level of the xapian database API; ie, > implement something more like the copydatabase tool, which copies > documents in the new order. I've written code (in python) to sort > databases using this method in the past - which worked ok for a few > million documents, but isn't particularly efficient. I don't have the > rights to distribute that code, but it was pretty simple. If I > remember correctly, it pulled the values to sort by into a numpy > array, and used one of numpy's functions to produce a mapping from the > old docid to the new docid, and then just ran through the old database > reading documents and writing them to the new database in the correct > position.Out of curiosity, if you left a gap between every docid will Xapian maintain an efficient index if you re-insert documents? e.g. a. 10 - 2010-05-01 b. 20 - 2010-06-01 c. 30 - 2010-07-01 Then at a later date you re-index c. as 2010-05-15 by giving it an intermediate docid: a. 10 - 2010-05-01 c. 15 - 2010-05-15 b. 30 - 2010-06-01 So maintaining a sorted index becomes an exercise in defragmenting rather than building an entire new DB whenever a document's ranking increases? An annoying problem only solvable by massive duplication (although my inexpert view would be ASC/DESC should be doable on a single index???). Cheers, Tim.