jarrod roberson
2006-Jul-16 01:24 UTC
[Xapian-discuss] Strange behavior with the Python bindings and threads
I have a program that walks the file system indexing files. There are three steps 1. open the file and read the info I need to index. 2. create a xapian document, populate it and add or replace as needed 3. update the file with the docid and write it back out to disk. doing this serially, is averaging 10ms per file on my test server. At that rate it is going to take me ~9 days to index the 75.5 million files! the test server is a QUAD processor Linux machine with 6GB of RAM and a RAID5 disk with 5 high speed disks, filesystem is Reiser. so I decided to try and thread the process since it is COMPLETEY IO bound. 1. the first thread walks the files and inserts them into a queue ( using the python queue.Queue() ) 2. a second thread "gets" from the first queue creates the document, and adds/replaces the doc in the index and updates the docid in the file it in another queue for closure. 3. a third thread "gets" from the second ( the already indexed files queue ) and then closes the file ( which writes the updates to the disk ) On my 17" Powerbook G4 the threaded version works great, it queues up the files, indexes them and closes them in a completely async pattern. Each thread reports batches of hits when it runs, and it sped up the processing by about 33% when I moved this code to my test server, everything runs SERIALLY. I coded all the put() and get() with a 3 second timeout, and it basically puts the file in the first queue, times out, the second thread indexes it, times out, the third thread closes it and then times out and then the next file runs. The only thing we can think of is the SWIG Python bindings aren't releasing the GIL correctly or something? Any ideas? A 30% speed up on ~9 days is significant!
Olly Betts
2006-Jul-16 01:53 UTC
[Xapian-discuss] Strange behavior with the Python bindings and threads
On Sat, Jul 15, 2006 at 08:23:49PM -0400, jarrod roberson wrote:> The only thing we can think of is the SWIG Python bindings aren't releasing > the GIL correctly or something?I don't know enough about them to say. You could look at the generated code (the C++ is python/xapian_wrap.cc and the Python is in python/modern/ for Python 2.2 and later) or ask on the SWIG mailing list.> Any ideas? A 30% speed up on ~9 days is significant!If you're looking for speedups, did you set XAPIAN_FLUSH_THRESHOLD in the environment? With 6GB of RAM you should be able to set a lot higher than the default of 10000 (try a million or maybe more). If you're I/O bound, the biggest speedups will come from simply reducing the amount of I/O! Cheers, Olly