Hi Olly, Thanks for the detailed response. I hadn’t realized there was a new xapian haystack backend. I’m going to try that but I have some upgrades to do first. Django 1.8, etc. Thanks, Ryan> On Feb 28, 2017, at 3:40 PM, Olly Betts <olly at survex.com> wrote: > > On Mon, Feb 27, 2017 at 10:29:46AM -0800, Ryan Cross wrote: >> I am trying to rebuild an index of 2+ million documents and have not been successful. I am running >> >> Python 2.7 >> Django 1.7 >> Haystack 2.1.1 >> Xapian 1.2.21 >> >> The index rebuild command I’m using is: django-admin.py rebuild_index --noinput --batch-size=100000 >> The rebuild completes but an immediate xapian-check returns this error: > [...] >> Trying the latest stable version, Xapian 1.4.3, it fails during the rebuild: >> >> All documents removed. >> Indexing 2233651 messages >> Traceback (most recent call last): >> … >> >> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 221, in handle_label >> self.update_backend(label, using) >> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 266, in update_backend >> do_update(backend, index, qs, start, end, total, self.verbosity) >> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 89, in do_update >> backend.update(index, current_qs) >> File "/a/mailarch/current/haystack/backends/xapian_backend.py", line 286, in update >> database.close() > > What's the version of xapian-haystack? There's not a database.close() anywhere > near line 286 in git master: > > https://github.com/notanumber/xapian-haystack/blob/master/xapian_backend.py#L286 > >> xapian.DatabaseCorruptError: Expected block 615203 to be level 0, not 1 >> docdata: >> blocksize=8K items=380000 firstunused=21983 revision=38 levels=2 root=21410 > > Is that the full output of xapian-check? > >> Any suggestions for how I could get more information to troubleshoot this >> failure would be greatly appreciated. > > Is the data to reproduce this something you can make available? > > I'd stick with Xapian 1.4.3 for trying to narrow this down (if it's a Xapian > bug we can backport the fix once identified). > > The error message means that a block which was expected to be at the leaf level > was actually marked as being one level above, which suggests either there's an > obscure bug in the backend code which only manifests in rare circumstances, or > something is corrupting data (could be in memory or on disk). > > Since this happens with both 1.2.x and 1.4.x I would tend to suspect it's > something external (rather than a bug in Xapian) as the default backends in 1.2 > and 1.4 have some significant differences. It's certainly possible it's a > Xapian bug, but if so I would expect we'd be seeing other reports, though maybe > we've actually had one or two and thought them due to #675, which was fixed in > 1.2.21 (however nobody's yet said "no, still seeing that"): > > https://trac.xapian.org/ticket/675 > > You could look at block 615203 of docdata.glass to see what it looks like - > that might offer clues: > > xxd -g1 -seek $((615203*8192)) -len 8192 docdata.glass > > It'd also be good to eliminate possible system issues - e.g. check the disk is > healthy (check the SMART status, run fsck on it), run a RAM test (distros often > provide a way to run memtest86+ or similar from the boot menu). > > Cheers, > Olly
Hi Olly, After upgrades my stack is now: Python 2.7 Django 1.8 Haystack 2.6.0 Xapian 1.4.3. (latest xapian haystack backend with some modifications) Using the same rebuild command as below but with —batch-size=50000 The issue has now become one of performance. I am indexing 2.2 million documents. Using delve I can see that performance starts off at about 100,000 records an hour. This is consistent with the roughly 24 hour rebuild time I was experiencing with Xapian 1.2.21 (chert). However, after 75 hours of build time, the index is about 75% complete and records are processing at a rate of 10,000/hr. The index is 51GB is size, 30GB is position.glass. Here is a one minute strace summary % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 63.97 1.272902 13 100240 pread 33.71 0.670733 14 48175 pwrite 0.57 0.011253 8 1484 read 0.45 0.008938 6 1524 fstat 0.36 0.007098 6 1270 lseek 0.25 0.004988 20 254 open 0.18 0.003544 14 254 recvfrom 0.11 0.002148 8 254 sendto 0.10 0.002056 8 254 close 0.10 0.001949 8 254 poll 0.07 0.001429 11 127 munmap 0.06 0.001111 9 127 mmap 0.04 0.000802 6 127 127 ioctl 0.04 0.000773 6 127 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 1.989724 154471 127 total This is ten documents with number of terms in the 10s - low100s range. Is there a way I can tune for better performance? Thanks, Ryan> On Mar 2, 2017, at 4:48 PM, Ryan Cross <rcross at amsl.com> wrote: > > Hi Olly, > > Thanks for the detailed response. I hadn’t realized there was a new xapian haystack backend. I’m going to try that but I have some upgrades to do first. Django 1.8, etc. > > Thanks, > Ryan > >> On Feb 28, 2017, at 3:40 PM, Olly Betts <olly at survex.com> wrote: >> >> On Mon, Feb 27, 2017 at 10:29:46AM -0800, Ryan Cross wrote: >>> I am trying to rebuild an index of 2+ million documents and have not been successful. I am running >>> >>> Python 2.7 >>> Django 1.7 >>> Haystack 2.1.1 >>> Xapian 1.2.21 >>> >>> The index rebuild command I’m using is: django-admin.py rebuild_index --noinput --batch-size=100000 >>> The rebuild completes but an immediate xapian-check returns this error: >> [...] >>> Trying the latest stable version, Xapian 1.4.3, it fails during the rebuild: >>> >>> All documents removed. >>> Indexing 2233651 messages >>> Traceback (most recent call last): >>> … >>> >>> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 221, in handle_label >>> self.update_backend(label, using) >>> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 266, in update_backend >>> do_update(backend, index, qs, start, end, total, self.verbosity) >>> File "/a/mailarch/current/haystack/management/commands/update_index.py", line 89, in do_update >>> backend.update(index, current_qs) >>> File "/a/mailarch/current/haystack/backends/xapian_backend.py", line 286, in update >>> database.close() >> >> What's the version of xapian-haystack? There's not a database.close() anywhere >> near line 286 in git master: >> >> https://github.com/notanumber/xapian-haystack/blob/master/xapian_backend.py#L286 >> >>> xapian.DatabaseCorruptError: Expected block 615203 to be level 0, not 1 >>> docdata: >>> blocksize=8K items=380000 firstunused=21983 revision=38 levels=2 root=21410 >> >> Is that the full output of xapian-check? >> >>> Any suggestions for how I could get more information to troubleshoot this >>> failure would be greatly appreciated. >> >> Is the data to reproduce this something you can make available? >> >> I'd stick with Xapian 1.4.3 for trying to narrow this down (if it's a Xapian >> bug we can backport the fix once identified). >> >> The error message means that a block which was expected to be at the leaf level >> was actually marked as being one level above, which suggests either there's an >> obscure bug in the backend code which only manifests in rare circumstances, or >> something is corrupting data (could be in memory or on disk). >> >> Since this happens with both 1.2.x and 1.4.x I would tend to suspect it's >> something external (rather than a bug in Xapian) as the default backends in 1.2 >> and 1.4 have some significant differences. It's certainly possible it's a >> Xapian bug, but if so I would expect we'd be seeing other reports, though maybe >> we've actually had one or two and thought them due to #675, which was fixed in >> 1.2.21 (however nobody's yet said "no, still seeing that"): >> >> https://trac.xapian.org/ticket/675 >> >> You could look at block 615203 of docdata.glass to see what it looks like - >> that might offer clues: >> >> xxd -g1 -seek $((615203*8192)) -len 8192 docdata.glass >> >> It'd also be good to eliminate possible system issues - e.g. check the disk is >> healthy (check the SMART status, run fsck on it), run a RAM test (distros often >> provide a way to run memtest86+ or similar from the boot menu). >> >> Cheers, >> Olly >
On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:> After upgrades my stack is now: > > Python 2.7 > Django 1.8 > Haystack 2.6.0 > Xapian 1.4.3. (latest xapian haystack backend with some modifications) > > Using the same rebuild command as below but with —batch-size=50000 > > The issue has now become one of performance. I am indexing 2.2 million > documents. Using delve I can see that performance starts off at about > 100,000 records an hour. This is consistent with the roughly 24 hour > rebuild time I was experiencing with Xapian 1.2.21 (chert). However, > after 75 hours of build time, the index is about 75% complete and > records are processing at a rate of 10,000/hr. The index is 51GB is > size, 30GB is position.glass.One of the big differences between chert and glass is that glass stores positional data in a different order such that phrase searches are much more I/O efficient. The downside is that this means extra work at index time, and more data to batch up in memory. There's a thread discussing this here: https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html> Here is a one minute strace summary > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 63.97 1.272902 13 100240 pread > 33.71 0.670733 14 48175 pwriteA one minute sample is hard to extrapolate from, as the indexing process currently goes through phases of flushing changes, so whichever phase the one minute is from isn't going to be representative. But from the information you give, my guess is that the extra memory used for batching up changes is pushing you over an I/O cliff, and you would get better throughput by reducing the batch size (assuming the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something equivalent). Especially likely if you tuned that batch size for chert. There are some longer term plans to rework the batching and flush process which should improve matters a lot (and hopefully remove the need for manually tweaking such settings). I'm hoping that will land in the next release series, so you could consider sticking with chert for 1.4.x, assuming the problematic phrase search cases aren't an issue for you. There are various other improvements between chert and glass (better tracking of free space, less on-disk overhead) which you'd lose out on though. Cheers, Olly