EJ Johnson
2007-Jul-07 18:05 UTC
[Xapian-discuss] Python bindings not freeing memory during indexing
Hi list, I'm new to Xapian (great stuff!!) but am running into a problem that I haven't seen explicitly mentioned on the list before. I'm using the Python bindings for Xapian 1.0.1 on Ubuntu Dapper 6.06 LTS using xapian.org as my repository. My hardware is an HP DL385 G2, two dual-core AMD Opterons with 8G RAM. I'm trying to index a good chuck of documents and have a python indexer iterating through the docs and adding them to the DB. I get up to about 45,000 docs and it croaks. Sometimes it throws some malloc error and the last time it just segfaulted. Essentially, the indexer process continues to use more and more RAM until it dies. It really only makes it up to about 3G of RAM before dying and it never hits swap. I tried tweaking the XAPIAN_FLUSH_THRESHOLD, but that doesn't seem to matter. I've tried using transactions and even called flush() after every 1000 docs to no avail. I've even tried destroying my DB handle to see if that would free up memory by setting it to None and then re-opening the DB every 1000 docs. I've finally found a work-around by having a wrapper script call my indexer for each 1000 docs as a separate process for each iteration. That seems to have solved my memory consumption problem and it appears that I'll finally be able to index my entire data set. Here's a snippet of a log that was tracking my DB/memory usage. The snippet was essentially the same for every failed attempt to work around the memory consumption. The first line is calling "du -sh" on my DB directory, the second line is a snippet from "top" (shows 1.3g RAM for the process, using 16.2% of RAM, and has been running for 16 minutes), the other lines are from delve. ==================================================================> Space on disk: 403M xapdb => 26611 ej.johns 17 0 1338m 1.3g 3728 D 97 16.2 16:38.51 ticketloader.py => Number of documents: 40000 => Highest doc number: 40000 => Average doc length: 630.6937 ==================================================================> Space on disk: 407M xapdb => 26611 ej.johns 21 0 1347m 1.3g 3728 R 101 16.3 16:45.63 ticketloader.py => Number of documents: 41000 => Highest doc number: 41000 => Average doc length: 630.53604878 ==================================================================> Space on disk: 409M xapdb => 26611 ej.johns 16 0 1357m 1.3g 3728 S 28 16.5 16:52.09 ticketloader.py => Number of documents: 41000 => Highest doc number: 41000 => Average doc length: 630.53604878 ================================================================= This next snippet of logs shows the same output from when I use my wrapper script to call out to the indexer in a separate process. ==================================================================> Space on disk: 409M xapdb => 29376 ej.johns 15 0 29000 23m 3676 S 36 0.3 0:03.51 ticketloader.py => Number of documents: 40203 => Highest doc number: 40203 => Average doc length: 698.684103176 ==================================================================> Space on disk: 411M xapdb => 29389 ej.johns 16 0 23076 17m 3676 S 22 0.2 0:01.27 ticketloader.py => Number of documents: 40625 => Highest doc number: 40625 => Average doc length: 697.493636923 ==================================================================> Space on disk: 413M xapdb => 29405 ej.johns 16 0 19304 13m 3676 S 28 0.2 0:00.51 ticketloader.py => Number of documents: 40953 => Highest doc number: 40953 => Average doc length: 697.187190194 ==================================================================> Space on disk: 414M xapdb => 29421 ej.johns 16 0 21328 15m 3672 S 18 0.2 0:00.73 ticketloader.py => Number of documents: 41177 => Highest doc number: 41177 => Average doc length: 696.544842995 ================================================================= So, you can see that the number of docs, disk space, doc length, etc are basically the same. The only difference is the amount of memory consumed during a single run versus individual runs in separate processes. My next step was to recompile Xapian and the Python bindings from source (1.0.2) is out now and see if that helps. Any other thoughts or suggestions are greatly appreciated! Thanks in advance, Eric Confidentiality Notice: This e-mail message (including any attached or embedded documents) is intended for the exclusive and confidential use of the individual or entity to which this message is addressed, and unless otherwise expressly indicated, is confidential and privileged information of Rackspace Managed Hosting. Any dissemination, distribution or copying of the enclosed material is prohibited. If you receive this transmission in error, please notify us immediately by e-mail at abuse@rackspace.com, and delete the original message. Your cooperation is appreciated.
Richard Boulton
2007-Jul-07 18:24 UTC
[Xapian-discuss] Python bindings not freeing memory during indexing
EJ Johnson wrote:> Hi list, > > I'm new to Xapian (great stuff!!) but am running into a problem that I haven't seen explicitly mentioned on the list before. > > I'm using the Python bindings for Xapian 1.0.1 on Ubuntu Dapper 6.06 LTS using xapian.org as my repository. My hardware is an HP DL385 G2, two dual-core AMD Opterons with 8G RAM. > > I'm trying to index a good chuck of documents and have a python indexer iterating through the docs and adding them to the DB. I get up to about 45,000 docs and it croaks. Sometimes it throws some malloc error and the last time it just segfaulted. Essentially, the indexer process continues to use more and more RAM until it dies. It really only makes it up to about 3G of RAM before dying and it never hits swap.That's odd: I'd expect it to be able to get up past the amount of physical memory before being killed off: have you been able to determine why is dies? ie, is there an OOM killer running, or is it due to an internal error? I've indexed some fairly large datasets with Xapian 1.0.1 using the Python bindings (around 20Gb databases), with no problems like this. Which version of Python are you using? I wonder if the problem could be python, rather than Xapian: it's fairly easy to fail to delete objects in python, and if there was a memory leak there, that could be the cause of the problem. Or maybe something in Python is If you ant to send your indexing script, I'll take a brief look at it and see if there's anything obvious wrong (probably send it direct to me, since the list won't accept attachments). I've also got a copy of valgrind set up to run python programs, so if you send me a couple of documents of sample data, I can try that out.> So, you can see that the number of docs, disk space, doc length, etc are basically the same.Well, actually the average document length is quite significantly smaller in the first set of log entries; something odd is definitely going on there.> My next step was to recompile Xapian and the Python bindings from source (1.0.2) is out now and see if that helps. Any other thoughts or suggestions are greatly appreciated!There are updated packages available in the xapian.org/debian repository, too, if you want to try those. -- Richard
Richard Boulton
2007-Jul-08 10:41 UTC
[Xapian-discuss] Python bindings not freeing memory during indexing
EJ Johnson wrote:> Thanks for taking the time to review my indexer Richard. I don't have > an easy way to send you sample documents, but can describe them.<snip>> So, I guess the only thing you can really do (since you can't really > test the code) is to tell me if I'm doing something stupid with the API.I've looked through the code, and it all looks perfectly sensible to me.> I'm using Python 2.4.3 that ships by default on Ubuntu 6.06 LTS server > edition (dapper).That should be fine.> As far as the errors, the first set of errors I was seeing was a failed > call to st95_malloc (or something like that) that seemed to be thrown > from the SWIG bindings. I'd have to re-run my older code and let it run > for 120minutes to regenerate the error.Have you re-tried with the 1.0.2 release? The best way forward, I think, is to re-run the test, with the workarounds removed so that the error occurs (if it still does with 1.0.2), and to keep the full output so we can pick through it. Looking into why there are different document sizes with and without your workaround would be useful, too. You could do this by comparing some random documents with the delve tool in the indexes built with and without the workaround. One other question - is the process you're running the indexer in using multiple (python) threads? There was a bug in 1.0.1 which could have caused corruption in this case - this is fixed in 1.0.2. -- Richard