On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:> After upgrades my stack is now: > > Python 2.7 > Django 1.8 > Haystack 2.6.0 > Xapian 1.4.3. (latest xapian haystack backend with some modifications) > > Using the same rebuild command as below but with —batch-size=50000 > > The issue has now become one of performance. I am indexing 2.2 million > documents. Using delve I can see that performance starts off at about > 100,000 records an hour. This is consistent with the roughly 24 hour > rebuild time I was experiencing with Xapian 1.2.21 (chert). However, > after 75 hours of build time, the index is about 75% complete and > records are processing at a rate of 10,000/hr. The index is 51GB is > size, 30GB is position.glass.One of the big differences between chert and glass is that glass stores positional data in a different order such that phrase searches are much more I/O efficient. The downside is that this means extra work at index time, and more data to batch up in memory. There's a thread discussing this here: https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html> Here is a one minute strace summary > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 63.97 1.272902 13 100240 pread > 33.71 0.670733 14 48175 pwriteA one minute sample is hard to extrapolate from, as the indexing process currently goes through phases of flushing changes, so whichever phase the one minute is from isn't going to be representative. But from the information you give, my guess is that the extra memory used for batching up changes is pushing you over an I/O cliff, and you would get better throughput by reducing the batch size (assuming the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something equivalent). Especially likely if you tuned that batch size for chert. There are some longer term plans to rework the batching and flush process which should improve matters a lot (and hopefully remove the need for manually tweaking such settings). I'm hoping that will land in the next release series, so you could consider sticking with chert for 1.4.x, assuming the problematic phrase search cases aren't an issue for you. There are various other improvements between chert and glass (better tracking of free space, less on-disk overhead) which you'd lose out on though. Cheers, Olly
On Sun, 2 Apr 2017, at 20:29, Olly Betts wrote:> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote: > > After upgrades my stack is now: > > > > Python 2.7 > > Django 1.8 > > Haystack 2.6.0 > > Xapian 1.4.3. (latest xapian haystack backend with some modifications) > > > > Using the same rebuild command as below but with —batch-size=50000 > > > > The issue has now become one of performance. I am indexing 2.2 million > > documents. Using delve I can see that performance starts off at about > > 100,000 records an hour. This is consistent with the roughly 24 hour > > rebuild time I was experiencing with Xapian 1.2.21 (chert). However, > > after 75 hours of build time, the index is about 75% complete and > > records are processing at a rate of 10,000/hr. The index is 51GB is > > size, 30GB is position.glass. > > One of the big differences between chert and glass is that glass stores > positional data in a different order such that phrase searches are much > more I/O efficient. The downside is that this means extra work at index > time, and more data to batch up in memory. There's a thread discussing > this here: > > https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html > > > Here is a one minute strace summary > > > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 63.97 1.272902 13 100240 pread > > 33.71 0.670733 14 48175 pwrite > > A one minute sample is hard to extrapolate from, as the indexing process > currently goes through phases of flushing changes, so whichever phase the > one minute is from isn't going to be representative. > > But from the information you give, my guess is that the extra memory > used for batching up changes is pushing you over an I/O cliff, and > you would get better throughput by reducing the batch size (assuming > the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something > equivalent). Especially likely if you tuned that batch size for chert. > > There are some longer term plans to rework the batching and flush process > which should improve matters a lot (and hopefully remove the need for > manually tweaking such settings). I'm hoping that will land in the > next release series, so you could consider sticking with chert for 1.4.x, > assuming the problematic phrase search cases aren't an issue for you. > There are various other improvements between chert and glass (better > tracking of free space, less on-disk overhead) which you'd lose out on > though.The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes and then compacting a few of them together at once may be interesting as well. I have no idea how it performs on really massive indexes though, because we index per user. Bron. -- Bron Gondwana brong at fastmail.fm
Thanks for the information on the differences between chert and glass. This explains the performance / index size changes I’m seeing. For the time being chert 1.4.3 is working and I’ll keep my eye out for new releases. Thanks, Ryan> On Apr 2, 2017, at 6:29 PM, Olly Betts <olly at survex.com> wrote: > > On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote: >> After upgrades my stack is now: >> >> Python 2.7 >> Django 1.8 >> Haystack 2.6.0 >> Xapian 1.4.3. (latest xapian haystack backend with some modifications) >> >> Using the same rebuild command as below but with —batch-size=50000 >> >> The issue has now become one of performance. I am indexing 2.2 million >> documents. Using delve I can see that performance starts off at about >> 100,000 records an hour. This is consistent with the roughly 24 hour >> rebuild time I was experiencing with Xapian 1.2.21 (chert). However, >> after 75 hours of build time, the index is about 75% complete and >> records are processing at a rate of 10,000/hr. The index is 51GB is >> size, 30GB is position.glass. > > One of the big differences between chert and glass is that glass stores > positional data in a different order such that phrase searches are much > more I/O efficient. The downside is that this means extra work at index > time, and more data to batch up in memory. There's a thread discussing > this here: > > https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html > >> Here is a one minute strace summary >> >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 63.97 1.272902 13 100240 pread >> 33.71 0.670733 14 48175 pwrite > > A one minute sample is hard to extrapolate from, as the indexing process > currently goes through phases of flushing changes, so whichever phase the > one minute is from isn't going to be representative. > > But from the information you give, my guess is that the extra memory > used for batching up changes is pushing you over an I/O cliff, and > you would get better throughput by reducing the batch size (assuming > the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something > equivalent). Especially likely if you tuned that batch size for chert. > > There are some longer term plans to rework the batching and flush process > which should improve matters a lot (and hopefully remove the need for > manually tweaking such settings). I'm hoping that will land in the > next release series, so you could consider sticking with chert for 1.4.x, > assuming the problematic phrase search cases aren't an issue for you. > There are various other improvements between chert and glass (better > tracking of free space, less on-disk overhead) which you'd lose out on > though. > > Cheers, > Olly
On Sun, Apr 02, 2017 at 10:40:22PM -0500, Bron Gondwana wrote:> The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes > and then compacting a few of them together at once may be interesting as > well. I have no idea how it performs on really massive indexes though, > because we index per user.Yes, that's a good approach for a large take-up, though may be harder to implement with a middle layer (xapian-haystack in this case) than if you're using xapian directly. It should scale well - the old gmane search used this, and indexed well over 100 million documents. Cheers, Olly