thr3ads.net - Xapian discuss - errors on rebuild [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2017-Apr-03 01:29 UTC

errors on rebuild

On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross
wrote:> After upgrades my stack is now:
> 
> Python 2.7
> Django 1.8
> Haystack 2.6.0
> Xapian 1.4.3. (latest xapian haystack backend with some modifications)
> 
> Using the same rebuild command as below but with —batch-size=50000
> 
> The issue has now become one of performance.  I am indexing 2.2 million
> documents.  Using delve I can see that performance starts off at about
> 100,000 records an hour.  This is consistent with the roughly 24 hour
> rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
> after 75 hours of build time, the index is about 75% complete and
> records are processing at a rate of 10,000/hr.  The index is 51GB is
> size, 30GB is position.glass.  
One of the big differences between chert and glass is that glass stores
positional data in a different order such that phrase searches are much
more I/O efficient.  The downside is that this means extra work at index
time, and more data to batch up in memory.  There's a thread discussing
this here:

https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
> Here is a one minute strace summary
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  63.97    1.272902          13    100240           pread
>  33.71    0.670733          14     48175           pwrite
A one minute sample is hard to extrapolate from, as the indexing process
currently goes through phases of flushing changes, so whichever phase the
one minute is from isn't going to be representative.

But from the information you give, my guess is that the extra memory
used for batching up changes is pushing you over an I/O cliff, and
you would get better throughput by reducing the batch size (assuming
the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or
something
equivalent).  Especially likely if you tuned that batch size for chert.

There are some longer term plans to rework the batching and flush process
which should improve matters a lot (and hopefully remove the need for
manually tweaking such settings).  I'm hoping that will land in the
next release series, so you could consider sticking with chert for 1.4.x,
assuming the problematic phrase search cases aren't an issue for you.
There are various other improvements between chert and glass (better
tracking of free space, less on-disk overhead) which you'd lose out on
though.

Cheers,
    Olly

Bron Gondwana

2017-Apr-03 03:40 UTC

head link

errors on rebuild

On Sun, 2 Apr 2017, at 20:29, Olly Betts wrote:> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
> > After upgrades my stack is now:
> > 
> > Python 2.7
> > Django 1.8
> > Haystack 2.6.0
> > Xapian 1.4.3. (latest xapian haystack backend with some modifications)
> > 
> > Using the same rebuild command as below but with —batch-size=50000
> > 
> > The issue has now become one of performance.  I am indexing 2.2
million
> > documents.  Using delve I can see that performance starts off at about
> > 100,000 records an hour.  This is consistent with the roughly 24 hour
> > rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
> > after 75 hours of build time, the index is about 75% complete and
> > records are processing at a rate of 10,000/hr.  The index is 51GB is
> > size, 30GB is position.glass.  
> 
> One of the big differences between chert and glass is that glass stores
> positional data in a different order such that phrase searches are much
> more I/O efficient.  The downside is that this means extra work at index
> time, and more data to batch up in memory.  There's a thread discussing
> this here:
> 
> https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
> 
> > Here is a one minute strace summary
> > 
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  63.97    1.272902          13    100240           pread
> >  33.71    0.670733          14     48175           pwrite
> 
> A one minute sample is hard to extrapolate from, as the indexing process
> currently goes through phases of flushing changes, so whichever phase the
> one minute is from isn't going to be representative.
> 
> But from the information you give, my guess is that the extra memory
> used for batching up changes is pushing you over an I/O cliff, and
> you would get better throughput by reducing the batch size (assuming
> the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or
something
> equivalent).  Especially likely if you tuned that batch size for chert.
> 
> There are some longer term plans to rework the batching and flush process
> which should improve matters a lot (and hopefully remove the need for
> manually tweaking such settings).  I'm hoping that will land in the
> next release series, so you could consider sticking with chert for 1.4.x,
> assuming the problematic phrase search cases aren't an issue for you.
> There are various other improvements between chert and glass (better
> tracking of free space, less on-disk overhead) which you'd lose out on
> though.
The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes and then
compacting a few of them together at once may be interesting as well.  I have no
idea how it performs on really massive indexes though, because we index per
user.

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm

Ryan Cross

2017-Apr-07 17:17 UTC

head link

errors on rebuild

Thanks for the information on the differences between chert and glass.
This explains the performance / index size changes I’m seeing.  For the 
time being chert 1.4.3 is working and I’ll keep my eye out for new releases.

Thanks,
Ryan
> On Apr 2, 2017, at 6:29 PM, Olly Betts <olly at survex.com> wrote:
> 
> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
>> After upgrades my stack is now:
>> 
>> Python 2.7
>> Django 1.8
>> Haystack 2.6.0
>> Xapian 1.4.3. (latest xapian haystack backend with some modifications)
>> 
>> Using the same rebuild command as below but with —batch-size=50000
>> 
>> The issue has now become one of performance.  I am indexing 2.2 million
>> documents.  Using delve I can see that performance starts off at about
>> 100,000 records an hour.  This is consistent with the roughly 24 hour
>> rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
>> after 75 hours of build time, the index is about 75% complete and
>> records are processing at a rate of 10,000/hr.  The index is 51GB is
>> size, 30GB is position.glass.  
> 
> One of the big differences between chert and glass is that glass stores
> positional data in a different order such that phrase searches are much
> more I/O efficient.  The downside is that this means extra work at index
> time, and more data to batch up in memory.  There's a thread discussing
> this here:
> 
> https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
> 
>> Here is a one minute strace summary
>> 
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> 63.97    1.272902          13    100240           pread
>> 33.71    0.670733          14     48175           pwrite
> 
> A one minute sample is hard to extrapolate from, as the indexing process
> currently goes through phases of flushing changes, so whichever phase the
> one minute is from isn't going to be representative.
> 
> But from the information you give, my guess is that the extra memory
> used for batching up changes is pushing you over an I/O cliff, and
> you would get better throughput by reducing the batch size (assuming
> the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or
something
> equivalent).  Especially likely if you tuned that batch size for chert.
> 
> There are some longer term plans to rework the batching and flush process
> which should improve matters a lot (and hopefully remove the need for
> manually tweaking such settings).  I'm hoping that will land in the
> next release series, so you could consider sticking with chert for 1.4.x,
> assuming the problematic phrase search cases aren't an issue for you.
> There are various other improvements between chert and glass (better
> tracking of free space, less on-disk overhead) which you'd lose out on
> though.
> 
> Cheers,
>    Olly

Olly Betts

2017-Apr-08 21:27 UTC

head link

errors on rebuild

On Sun, Apr 02, 2017 at 10:40:22PM -0500, Bron Gondwana
wrote:> The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes
> and then compacting a few of them together at once may be interesting as
> well.  I have no idea how it performs on really massive indexes though,
> because we index per user.
Yes, that's a good approach for a large take-up, though may be harder
to implement with a middle layer (xapian-haystack in this case) than
if you're using xapian directly.

It should scale well - the old gmane search used this, and indexed
well over 100 million documents.

Cheers,
    Olly

Seemingly Similar Threads

Search for more apparently analagous threads

Xapian discuss - Apr 2017 - errors on rebuild

errors on rebuild

errors on rebuild

errors on rebuild

errors on rebuild

Seemingly Similar Threads