On 14/09/2018 at 09:30, Jean-Francois Dockes wrote:> Hi, > > You may be interested by how Recoll does it: > > https://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html > > A few things in the document are slightly obsolete (esp. the last > paragraph: recollindex now does use vfork()), but it's overall quite close > to how the current indexer works. > > jfd >Thank for your answer, briefly it's No:> The Xapian library index updating code is not designed for multi-threading and must stay protected from multiple accesses.just for evaluation purpose could you provide me some links to the code about how Recoll parallelizes "Data extraction and Conversion" and "Term generation". Thanks in advance, best regards -- Franco Martelli
Franco Martelli writes: > On 14/09/2018 at 09:30, Jean-Francois Dockes wrote: > > Hi, > > > > You may be interested by how Recoll does it: > > > > https://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html > > > > A few things in the document are slightly obsolete (esp. the last > > paragraph: recollindex now does use vfork()), but it's overall quite close > > to how the current indexer works. > > > > jfd > > > Thank for your answer, briefly it's No: > > > The Xapian library index updating code is not designed for > > multi-threading and must stay protected from multiple accesses. Yes, obviously the Xapian part stays single-threaded. > just for evaluation purpose could you provide me some links to the code > about how Recoll parallelizes "Data extraction and Conversion" and "Term > generation". The code repository is here: https://opensourceprojects.eu/p/recoll1/code/ Or else download a tar release from here: https://www.lesbonscomptes.com/recoll/download.html The extraction code is mostly under the "internfile" directory. Look in index/fsindexer.cpp and rcldb/rcldb.cpp for the job queues. Get in touch directly with me if you have questions, this is not really Xapian-related (once you've realized that the db work will stay single-threaded). jfd
On Fri, Sep 14, 2018 at 06:09:02PM +0200, Franco Martelli wrote:> Thank for your answer, briefly it's No: > > On 14/09/2018 at 09:30, Jean-Francois Dockes wrote: > > The Xapian library index updating code is not designed for > > multi-threading and must stay protected from multiple accesses.That's perhaps slightly misleading - it's really the data that needs protecting, not the code as such. Xapian doesn't attempt to protect any of its objects from concurrent access, but you can use different objects from the same class concurrently. So for example you can have two different databases open in two different threads and index to them from concurrently. You can then either just search those databases together or use Xapian::Database::compact() (or the xapian-compact command line tool) to merge the databases once built. Or if your document processing stage is relatively slow and you're using Xapian 1.4.6 or newer, you can prepare Xapian::Document objects in worker threads and then use the new support for C++11 move semantics to cleanly and efficiently hand them to a single indexer thread to add to the database. Before 1.4.6 you'd have had to use some sort of locking to do that safely. Cheers, Olly
On 21/09/2018 at 08:03, Olly Betts wrote:> > You can then either just search those databases together or use > Xapian::Database::compact() (or the xapian-compact command line tool) to > merge the databases once built. >just my 2 cents tips: a commit() to the database is needed before compact() otherwise every attempt to search fails. This on GNU Linux Debian 9.5 Xapian: libxapian30:amd64 1.4.3-2+deb9u1 Thanks for your answer, best regards -- Franco Martelli