thr3ads.net - Xapian discuss - [Xapian-discuss] Python bindings not freeing memory during indexing [Jul 2007]

If this information is useful, please help other people find it:
Share via:

EJ Johnson

2007-Jul-07 18:05 UTC

[Xapian-discuss] Python bindings not freeing memory during indexing

Hi list,

I'm new to Xapian (great stuff!!) but am running into a problem that I
haven't seen explicitly mentioned on the list before.

I'm using the Python bindings for Xapian 1.0.1 on Ubuntu Dapper 6.06 LTS
using xapian.org as my repository.  My hardware is an HP DL385 G2, two dual-core
AMD Opterons with 8G RAM.

I'm trying to index a good chuck of documents and have a python indexer
iterating through the docs and adding them to the DB.  I get up to about 45,000
docs and it croaks.  Sometimes it throws some malloc error and the last time it
just segfaulted.  Essentially, the indexer process continues to use more and
more RAM until it dies.  It really only makes it up to about 3G of RAM before
dying and it never hits swap.

I tried tweaking the XAPIAN_FLUSH_THRESHOLD, but that doesn't seem to
matter.  I've tried using transactions and even called flush() after every
1000 docs to no avail.  I've even tried destroying my DB handle to see if
that would free up memory by setting it to None and then re-opening the DB every
1000 docs.

I've finally found a work-around by having a wrapper script call my indexer
for each 1000 docs as a separate process for each iteration.  That seems to have
solved my memory consumption problem and it appears that I'll finally be
able to index my entire data set.

Here's a snippet of a log that was tracking my DB/memory usage.  The snippet
was essentially the same for every failed attempt to work around the memory
consumption.  The first line is calling "du -sh" on my DB directory,
the second line is a snippet from "top" (shows 1.3g RAM for the
process, using 16.2% of RAM, and has been running for 16 minutes), the other
lines are from delve.

==================================================================> Space on
disk: 403M  xapdb
=> 26611 ej.johns  17   0 1338m 1.3g 3728 D   97 16.2  16:38.51
ticketloader.py
=> Number of documents: 40000
=> Highest doc number: 40000
=> Average doc length: 630.6937
==================================================================> Space on
disk: 407M  xapdb
=> 26611 ej.johns  21   0 1347m 1.3g 3728 R  101 16.3  16:45.63
ticketloader.py
=> Number of documents: 41000
=> Highest doc number: 41000
=> Average doc length: 630.53604878
==================================================================> Space on
disk: 409M  xapdb
=> 26611 ej.johns  16   0 1357m 1.3g 3728 S   28 16.5  16:52.09
ticketloader.py
=> Number of documents: 41000
=> Highest doc number: 41000
=> Average doc length: 630.53604878
=================================================================
This next snippet of logs shows the same output from when I use my wrapper
script to call out to the indexer in a separate process.

==================================================================> Space on
disk: 409M  xapdb
=> 29376 ej.johns  15   0 29000  23m 3676 S   36  0.3   0:03.51
ticketloader.py
=> Number of documents: 40203
=> Highest doc number: 40203
=> Average doc length: 698.684103176
==================================================================> Space on
disk: 411M  xapdb
=> 29389 ej.johns  16   0 23076  17m 3676 S   22  0.2   0:01.27
ticketloader.py
=> Number of documents: 40625
=> Highest doc number: 40625
=> Average doc length: 697.493636923
==================================================================> Space on
disk: 413M  xapdb
=> 29405 ej.johns  16   0 19304  13m 3676 S   28  0.2   0:00.51
ticketloader.py
=> Number of documents: 40953
=> Highest doc number: 40953
=> Average doc length: 697.187190194
==================================================================> Space on
disk: 414M  xapdb
=> 29421 ej.johns  16   0 21328  15m 3672 S   18  0.2   0:00.73
ticketloader.py
=> Number of documents: 41177
=> Highest doc number: 41177
=> Average doc length: 696.544842995
=================================================================
So, you can see that the number of docs, disk space, doc length, etc are
basically the same.  The only difference is the amount of memory consumed during
a single run versus individual runs in separate processes.

My next step was to recompile Xapian and the Python bindings from source (1.0.2)
is out now and see if that helps.  Any other thoughts or suggestions are greatly
appreciated!

Thanks in advance,
Eric


Confidentiality Notice: This e-mail message (including any attached or
embedded documents) is intended for the exclusive and confidential use of the
individual or entity to which this message is addressed, and unless otherwise
expressly indicated, is confidential and privileged information of Rackspace
Managed Hosting. Any dissemination, distribution or copying of the enclosed
material is prohibited. If you receive this transmission in error, please
notify us immediately by e-mail at abuse@rackspace.com, and delete the
original message. Your cooperation is appreciated.

Richard Boulton

2007-Jul-07 18:24 UTC

head link

[Xapian-discuss] Python bindings not freeing memory during indexing

EJ Johnson wrote:> Hi list,
> 
> I'm new to Xapian (great stuff!!) but am running into a problem that I
haven't seen explicitly mentioned on the list before.
> 
> I'm using the Python bindings for Xapian 1.0.1 on Ubuntu Dapper 6.06
LTS using xapian.org as my repository.  My hardware is an HP DL385 G2, two
dual-core AMD Opterons with 8G RAM.
> 
> I'm trying to index a good chuck of documents and have a python indexer
iterating through the docs and adding them to the DB.  I get up to about 45,000
docs and it croaks.  Sometimes it throws some malloc error and the last time it
just segfaulted.  Essentially, the indexer process continues to use more and
more RAM until it dies.  It really only makes it up to about 3G of RAM before
dying and it never hits swap.
That's odd: I'd expect it to be able to get up past the amount of 
physical memory before being killed off: have you been able to determine 
why is dies?  ie, is there an OOM killer running, or is it due to an 
internal error?

I've indexed some fairly large datasets with Xapian 1.0.1 using the 
Python bindings (around 20Gb databases), with no problems like this.

Which version of Python are you using?  I wonder if the problem could be 
python, rather than Xapian: it's fairly easy to fail to delete objects 
in python, and if there was a memory leak there, that could be the cause 
of the problem.  Or maybe something in Python is

If you ant to send your indexing script, I'll take a brief look at it 
and see if there's anything obvious wrong (probably send it direct to 
me, since the list won't accept attachments).  I've also got a copy of 
valgrind set up to run python programs, so if you send me a couple of 
documents of sample data, I can try that out.

> So, you can see that the number of docs, disk space, doc length, etc are
basically the same.
Well, actually the average document length is quite significantly 
smaller in the first set of log entries; something odd is definitely 
going on there.
> My next step was to recompile Xapian and the Python bindings from source
(1.0.2) is out now and see if that helps.  Any other thoughts or suggestions are
greatly appreciated!
There are updated packages available in the xapian.org/debian 
repository, too, if you want to try those.

-- 
Richard

Richard Boulton

2007-Jul-08 10:41 UTC

head link

[Xapian-discuss] Python bindings not freeing memory during indexing

EJ Johnson wrote:> Thanks for taking the time to review my indexer Richard.  I don't have 
> an easy way to send you sample documents, but can describe them.
<snip>> So, I guess the only thing you can really do (since you can't really 
> test the code) is to tell me if I'm doing something stupid with the
API.
I've looked through the code, and it all looks perfectly sensible to me.
> I'm using Python 2.4.3 that ships by default on Ubuntu 6.06 LTS server 
> edition (dapper).
That should be fine.
> As far as the errors, the first set of errors I was seeing was a failed 
> call to st95_malloc (or something like that) that seemed to be thrown 
> from the SWIG bindings.  I'd have to re-run my older code and let it
run
> for 120minutes to regenerate the error.
Have you re-tried with the 1.0.2 release?  The best way forward, I 
think, is to re-run the test, with the workarounds removed so that the 
error occurs (if it still does with 1.0.2), and to keep the full output 
so we can pick through it.

Looking into why there are different document sizes with and without 
your workaround would be useful, too.  You could do this by comparing 
some random documents with the delve tool in the indexes built with and 
without the workaround.

One other question - is the process you're running the indexer in using 
multiple (python) threads?  There was a bug in 1.0.1 which could have 
caused corruption in this case - this is fixed in 1.0.2.

-- 
Richard

Xapian discuss - Jul 2007 - Python bindings not freeing memory during indexing

[Xapian-discuss] Python bindings not freeing memory during indexing

[Xapian-discuss] Python bindings not freeing memory during indexing

[Xapian-discuss] Python bindings not freeing memory during indexing