thr3ads.net - Xapian discuss - xapian 1.4 performance issue [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2017-Dec-07 09:29 UTC

xapian 1.4 performance issue

Hi,

I have had reports that Recoll has become unbearingly slow in some
instances.

After inquiry, this happens with Xapian 1.4 only, and the part which does
not work any more is the snippets extraction.

Recoll builds snippets by partially reconstructing documents out of index
contents.

For this, after determining a set of document term positions to be
displayed (around the hopefully interesting hits), it walks the document
term list, and, for each term, walks its position list looking for matches
with the target positions (there is no other way that I know of to
determine the term at a given position).

This used to be always very fast with Xapian 1.2. I do understand that this
is a very intensive operation, but performance has never been an issue at
all for displaying typical screens of 8-15 document abstracts.

This operation has become unbearingly slow in some cases with Xapian 1.4,
especially when a document has many terms.

The specific operation which has become slow is opening many term position
lists, each quite short.

In a quite typical example, the abstract generation time has gone from 100 mS
to 90 S (SECONDS), on a cold index. Users don't like it, they think that
the application is dead, this is what has triggered the user reports.

The TLDR is that Recoll is unusable with Xapian 1.4.

I don't know why I had not seen it earlier, probably because I always work
with warm indexes, this is an I/O issue.

Any idea how I can work around this ?

J.F. Dockes

Olly Betts

2017-Dec-07 22:03 UTC

head link

xapian 1.4 performance issue

On Thu, Dec 07, 2017 at 10:29:09AM +0100, Jean-Francois Dockes
wrote:> Recoll builds snippets by partially reconstructing documents out of index
> contents.
> 
[...]> 
> The specific operation which has become slow is opening many term position
> lists, each quite short.
The difference will actually be chert vs glass, rather than 1.2 vs 1.4
as such (glass is the new backend in 1.4 and now the default).

This is a consequence of the change to the ordering within the position
list table.  In chert, it was keyed on (documentid, term), so all the
position lists for a document were together (good spatial locality for
what you are doing).  In glass, it is keyed on (term, documentid), so
all the position lists for a term are now together - this gives good
spatial locality for queries - a phrase search wants the positional data
for a small number of terms in (potentially) many documents, and the
more documents positional data is wanted for, the better the locality of
access is.  And indeed this change delivered a big improvement for
previously very slow phrase search cases.

I'm not sure there's much we can do about this directly - changing the
order back would undo the speed up to slow phrases.  And while
recreating documents like this is a neat idea, making positional data
faster for how it's intended to be used seems more critical.
> In a quite typical example, the abstract generation time has gone from 100
mS
> to 90 S (SECONDS), on a cold index. Users don't like it, they think
that
> the application is dead, this is what has triggered the user reports.
> 
> The TLDR is that Recoll is unusable with Xapian 1.4.
> 
> I don't know why I had not seen it earlier, probably because I always
work
> with warm indexes, this is an I/O issue.
> 
> Any idea how I can work around this ?
Some options:

* Explicitly create chert databases.  This isn't a good long term option
  (chert is already removed in git master) but would ease the immediate
  user pain while you work on a better solution.

* Store the extracted text (e.g. in the document data, which will be
  compressed using zlib for you).  That's more data to store, but a
  benefit is you get capitalisation and punctuation.  You can reasonably
  limit the amount stored, as the chances of finding a better snippet
  will tend to tail off for large documents.

* Store the extracted text, but compressed using the document's term
  list as a dictionary - could be with an existing algorithm which
  supports a dictionary (e.g. zlib) or a custom algorithm.  Potentially
  gives better compression than the first option, though large documents
  will benefit less.  I'd suggest a quick prototype to see if it's worth
  the effort.

* If you're feeding the reconstructed text into something to dynamically
  select a snippet, then that could potentially be driven from a lazy
  merging of the positional data (e.g. a min heap of PositionIterator).
  If the snippet selector terminates once it's found a decent snippet
  then this would avoid having to decode all the positional data, but it
  would have to read it to find the first position of every term, so it
  doesn't really address the data locality issue.  It also would have to
  handle positional data for a lot of terms in parallel and the larger
  working set may not fit in the CPU cache.

* Re-extract the text for documents you display, possibly with a caching
  layer if you have a high search load (I doubt a single-user search
  like recoll would need one).  If you have some slow extractors (e.g.
  OCR) then you could store text from those (perhaps just store based on
  how long the extraction took at index time, and users can tune that
  based on how much they care about extra disk usage).  An added benefit
  is that you get to show the current version of the document, rather
  than the version that was last indexed.  This seems like a good option
  for recoll.

Cheers,
    Olly

Jean-Francois Dockes

2017-Dec-08 10:08 UTC

head link

xapian 1.4 performance issue

Olly Betts writes:
 > On Thu, Dec 07, 2017 at 10:29:09AM +0100, Jean-Francois Dockes wrote:
 > > Recoll builds snippets by partially reconstructing documents out of
index
 > > contents.
 > > 
 > [...]
 > > 
 > > The specific operation which has become slow is opening many term
position
 > > lists, each quite short.
 > 
 > The difference will actually be chert vs glass, rather than 1.2 vs 1.4
 > as such (glass is the new backend in 1.4 and now the default).

I had sort of guessed this far :)

 > [... physical ordering of the position list table changed ...]
 > 
 > I'm not sure there's much we can do about this directly - changing
the
 > order back would undo the speed up to slow phrases.  And while
 > recreating documents like this is a neat idea, making positional data
 > faster for how it's intended to be used seems more critical.
 > 
 > > 
 > > Any idea how I can work around this ?
 > 
 > Some options:
 > 
 > * Explicitly create chert databases.  This isn't a good long term
option
 >   (chert is already removed in git master) but would ease the immediate
 >   user pain while you work on a better solution.


This is the only really short term solution: any other is weeks or months
away. Is the "stub database" feature the appropriate way to create
Chert
databases with Xapian 1.4 ?

Another possibility for me would be to decide that Chert is good enough and
appropriate for Recoll, and bundle it together with the appropriate Xapian
parts.

Thanks for the other suggestions. I will think about how I could make them
work.

- Storing extracted text: users are already complaining about index
  sizes. Still this looks the most promising approach, and I already have
  code to more or less do it.

- Re-extracting at query time: the extraction time is not easily
  predictable: it's not only the document type, but its location, possibly
  inside an archive, and its exact nature (pdf extraction times vary widely
  for example).
  Recoll needs to generate 8-10 snippet sets to display a result page.
  Needing a significant fraction of a second to a few seconds for text
  extraction from a single document is nothing extraordinary and would
  result in very slow result display times (even for a 'single-user
  search'). A cache would not be very helpful for single/few-user usage. 

Cheers,

jf


 > * Store the extracted text (e.g. in the document data, which will be
 >   compressed using zlib for you).  That's more data to store, but a
 >   benefit is you get capitalisation and punctuation.  You can reasonably
 >   limit the amount stored, as the chances of finding a better snippet
 >   will tend to tail off for large documents.
 > 
 > * Store the extracted text, but compressed using the document's term
 >   list as a dictionary - could be with an existing algorithm which
 >   supports a dictionary (e.g. zlib) or a custom algorithm.  Potentially
 >   gives better compression than the first option, though large documents
 >   will benefit less.  I'd suggest a quick prototype to see if it's
worth
 >   the effort.

 > * If you're feeding the reconstructed text into something to
dynamically
 >   select a snippet, then that could potentially be driven from a lazy
 >   merging of the positional data (e.g. a min heap of PositionIterator).
 >   If the snippet selector terminates once it's found a decent snippet
 >   then this would avoid having to decode all the positional data, but it
 >   would have to read it to find the first position of every term, so it
 >   doesn't really address the data locality issue.  It also would have
to
 >   handle positional data for a lot of terms in parallel and the larger
 >   working set may not fit in the CPU cache.
 > 
 > * Re-extract the text for documents you display, possibly with a caching
 >   layer if you have a high search load (I doubt a single-user search
 >   like recoll would need one).  If you have some slow extractors (e.g.
 >   OCR) then you could store text from those (perhaps just store based on
 >   how long the extraction took at index time, and users can tune that
 >   based on how much they care about extra disk usage).  An added benefit
 >   is that you get to show the current version of the document, rather
 >   than the version that was last indexed.  This seems like a good option
 >   for recoll.

Reasonably Related Threads

Search for more possibly parallel threads

Xapian discuss - Dec 2017 - xapian 1.4 performance issue

xapian 1.4 performance issue

xapian 1.4 performance issue

xapian 1.4 performance issue

Reasonably Related Threads