Jean-Francois Dockes
2024-Mar-15 19:15 UTC
Using multiple temporary indexes during updates
Hi, I have been playing at converting the index update stage of the Recoll indexer to use multiple temporary indexes and a final merge. This yields an improvement factor of almost 3 (on my quad-core CPU), for the total indexing time for "easy" files like HTML pages. This is nice (!) and I wanted to share my admiration for the "compact()" method. If someone is interested in a bit more detail: https://www.recoll.org/pages/idxthreads/threadingRecoll.html#_the_xapian_bottleneck_and_how_it_was_resolved_thanks_to_xapian Cheers, jf
On Fri, Mar 15, 2024 at 08:15:55PM +0100, Jean-Francois Dockes wrote:> I have been playing at converting the index update stage of the Recoll indexer to use > multiple temporary indexes and a final merge. > > This yields an improvement factor of almost 3 (on my quad-core CPU), for the total > indexing time for "easy" files like HTML pages. This is nice (!) and I wanted to share my > admiration for the "compact()" method. > > If someone is interested in a bit more detail: > https://www.recoll.org/pages/idxthreads/threadingRecoll.html#_the_xapian_bottleneck_and_how_it_was_resolved_thanks_to_xapianNice write-up! It'd be helpful to note the Xapian version you're using for such benchmarking as the results are likely to evolve over time. Also are you using Xapian::DBCOMPACT_MULTIPASS? The linked page doesn't seem to say. In theory it should be faster when merging many databases, but Tom Mortimer reported he found it slower. That was a long time ago, but I've never managed to get around to profiling to see what was going on or if it was even still the case (probably makes most sense to do at the same time as implementing https://trac.xapian.org/ticket/444 ). Incidentally, for the "fork() on a large process is slow" bit at the end, posix_spawn() may help assuming it's flexible enough to do what you want. The glibc implementation calls "clone(2) with CLONE_VM and CLONE_VFORK flags". Cheers, Olly