thr3ads.net - Xapian discuss - xapian 1.4 performance issue [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2017-Dec-08 10:08 UTC

xapian 1.4 performance issue

Olly Betts writes:
 > On Thu, Dec 07, 2017 at 10:29:09AM +0100, Jean-Francois Dockes wrote:
 > > Recoll builds snippets by partially reconstructing documents out of
index
 > > contents.
 > > 
 > [...]
 > > 
 > > The specific operation which has become slow is opening many term
position
 > > lists, each quite short.
 > 
 > The difference will actually be chert vs glass, rather than 1.2 vs 1.4
 > as such (glass is the new backend in 1.4 and now the default).

I had sort of guessed this far :)

 > [... physical ordering of the position list table changed ...]
 > 
 > I'm not sure there's much we can do about this directly - changing
the
 > order back would undo the speed up to slow phrases.  And while
 > recreating documents like this is a neat idea, making positional data
 > faster for how it's intended to be used seems more critical.
 > 
 > > 
 > > Any idea how I can work around this ?
 > 
 > Some options:
 > 
 > * Explicitly create chert databases.  This isn't a good long term
option
 >   (chert is already removed in git master) but would ease the immediate
 >   user pain while you work on a better solution.


This is the only really short term solution: any other is weeks or months
away. Is the "stub database" feature the appropriate way to create
Chert
databases with Xapian 1.4 ?

Another possibility for me would be to decide that Chert is good enough and
appropriate for Recoll, and bundle it together with the appropriate Xapian
parts.

Thanks for the other suggestions. I will think about how I could make them
work.

- Storing extracted text: users are already complaining about index
  sizes. Still this looks the most promising approach, and I already have
  code to more or less do it.

- Re-extracting at query time: the extraction time is not easily
  predictable: it's not only the document type, but its location, possibly
  inside an archive, and its exact nature (pdf extraction times vary widely
  for example).
  Recoll needs to generate 8-10 snippet sets to display a result page.
  Needing a significant fraction of a second to a few seconds for text
  extraction from a single document is nothing extraordinary and would
  result in very slow result display times (even for a 'single-user
  search'). A cache would not be very helpful for single/few-user usage. 

Cheers,

jf


 > * Store the extracted text (e.g. in the document data, which will be
 >   compressed using zlib for you).  That's more data to store, but a
 >   benefit is you get capitalisation and punctuation.  You can reasonably
 >   limit the amount stored, as the chances of finding a better snippet
 >   will tend to tail off for large documents.
 > 
 > * Store the extracted text, but compressed using the document's term
 >   list as a dictionary - could be with an existing algorithm which
 >   supports a dictionary (e.g. zlib) or a custom algorithm.  Potentially
 >   gives better compression than the first option, though large documents
 >   will benefit less.  I'd suggest a quick prototype to see if it's
worth
 >   the effort.

 > * If you're feeding the reconstructed text into something to
dynamically
 >   select a snippet, then that could potentially be driven from a lazy
 >   merging of the positional data (e.g. a min heap of PositionIterator).
 >   If the snippet selector terminates once it's found a decent snippet
 >   then this would avoid having to decode all the positional data, but it
 >   would have to read it to find the first position of every term, so it
 >   doesn't really address the data locality issue.  It also would have
to
 >   handle positional data for a lot of terms in parallel and the larger
 >   working set may not fit in the CPU cache.
 > 
 > * Re-extract the text for documents you display, possibly with a caching
 >   layer if you have a high search load (I doubt a single-user search
 >   like recoll would need one).  If you have some slow extractors (e.g.
 >   OCR) then you could store text from those (perhaps just store based on
 >   how long the extraction took at index time, and users can tune that
 >   based on how much they care about extra disk usage).  An added benefit
 >   is that you get to show the current version of the document, rather
 >   than the version that was last indexed.  This seems like a good option
 >   for recoll.

Olly Betts

2017-Dec-13 02:50 UTC

head link

xapian 1.4 performance issue

On Fri, Dec 08, 2017 at 11:08:00AM +0100, Jean-Francois Dockes
wrote:> This is the only really short term solution: any other is weeks or months
> away. Is the "stub database" feature the appropriate way to
create Chert
> databases with Xapian 1.4 ?
With 1.4 you can pass Xapian::DB_BACKEND_CHERT in the flags when
constructing the WritableDatabase object.

I noticed recently that this doesn't quite work as advertised in the
case when the database already exists but is not of the specified type.
It's meant to just open the database in that case (and ignore the
backend hint), but it actually seems to create a new database with the
specified backend in the same directory.  I'll fix that, but obviously
that won't help with existing releases.  You can try with
Xapian::DB_OPEN first, then Xapian::DB_BACKEND_CHERT if that fails,
though that's slightly racy.  Not sure there's a better workaround
though.
> Another possibility for me would be to decide that Chert is good enough and
> appropriate for Recoll, and bundle it together with the appropriate Xapian
> parts.
That wouldn't be popular with distros packaging recoll - they'll want to
use their existing Xapian packages instead of a bundled code copy, e.g.
see:

https://wiki.debian.org/UpstreamGuide#No_inclusion_of_third_party_code
https://fedoraproject.org/wiki/Bundled_Libraries
https://wiki.gentoo.org/wiki/Why_not_bundle_dependencies#When_code_is_bundled.3F

It also means you wouldn't benefit from improvements in new Xapian
releases, and would end up having to maintain the old version you picked
yourself.

Cheers,
    Olly

Jean-Francois Dockes

2017-Dec-15 16:25 UTC

head link

xapian 1.4 performance issue

Olly Betts writes:
 > On Fri, Dec 08, 2017 at 11:08:00AM +0100, Jean-Francois Dockes wrote:
 > > This is the only really short term solution: any other is weeks or
months
 > > away. Is the "stub database" feature the appropriate way to
create Chert
 > > databases with Xapian 1.4 ?
 > 
 > With 1.4 you can pass Xapian::DB_BACKEND_CHERT in the flags when
 > constructing the WritableDatabase object.
 > 
 > I noticed recently that this doesn't quite work as advertised in the
 > case when the database already exists but is not of the specified type.
 > It's meant to just open the database in that case (and ignore the
 > backend hint), but it actually seems to create a new database with the
 > specified backend in the same directory.  I'll fix that, but obviously
 > that won't help with existing releases.  You can try with
 > Xapian::DB_OPEN first, then Xapian::DB_BACKEND_CHERT if that fails,
 > though that's slightly racy.  Not sure there's a better workaround
 > though.

I hadn't noticed that Xapian now had these db creation flags, so I am using
a stub file for creating a new index and it seems to work fine.

 > > Another possibility for me would be to decide that Chert is good
enough and
 > > appropriate for Recoll, and bundle it together with the appropriate
Xapian
 > > parts.
 > 
 > That wouldn't be popular with distros packaging recoll - they'll
want to
 > use their existing Xapian packages instead of a bundled code copy, e.g.
 > see:
 > 
 > https://wiki.debian.org/UpstreamGuide#No_inclusion_of_third_party_code
 > https://fedoraproject.org/wiki/Bundled_Libraries
 >
https://wiki.gentoo.org/wiki/Why_not_bundle_dependencies#When_code_is_bundled.3F
 > 
 > It also means you wouldn't benefit from improvements in new Xapian
 > releases, and would end up having to maintain the old version you picked
 > yourself.

Thanks for the links about bundling !

Yes, distribution policies and maintenance are definitely problems for this
approach, which remains a possibility if it proves too hard or too onerous
(in terms of index size or query times) to do things otherwise .

Recoll already has a Fedora exemption for bundling code from an old imap
server package, and I think that I can make a reasonable case for another
exemption. I'd strip down Xapian source to keep only the backend and
associated code (no need for the query parser, unicode etc.), and probably
link statically. The static link part will be a return to what I did when
Xapian itself was not universally packaged :) At worse, not being in the
distributions is not the end of the world.

I am also not too worried about maintenance, the old index format has
worked largely flawlessly for quite some time, so it will be mainly a
question of fixing compiler compatibility issues from time to time.  The
fact that I can consider doing this is a tribute to Xapian code quality, by
the way.

Also, it might put me in a position to do something about my old wish for
Xapian query interruptibility...

Still, it is a last resort option, no doubt. The priority is exploring the
impact of storing the document texts.

Cheers,

jf

Apparently Analagous Threads

Search for more apparently analagous threads

Xapian discuss - Dec 2017 - xapian 1.4 performance issue

xapian 1.4 performance issue

xapian 1.4 performance issue

xapian 1.4 performance issue

Apparently Analagous Threads