Since a couple of weeks we are experiencing occasional segmentation faults within Xapian 1.5. We can't reproduce the crashes, but we have strong hints that they are due to memory corruption. We have narrowed down our root cause analysis to phrase searches on multi-databases that fail on reading the `hint` field in the`QueryOptimiser`class [1]. We'd appreciate any hints on how to fix this. I've written up our findings and solution attempts below. Should we post this on trac? Our findings so far ===============In a core dump we see that calling the `open_nearby_postlist` function on the `hint` variable [2] falls of a cliff, resulting in a segfault: (gdb) bt 2 #0 0x000000000001eaa1 in ?? () #1 0x00007fa19d09231f in LocalSubMatch::open_post_list (this=0x13527d0, term=..., wqf=1, factor=1, need_positions=<optimized out>, in_synonym=<optimized out>, qopt=0x7ffe66370940, lazy_weight=false) at matcher/localsubmatch.cc:289 the line at localsubmatch.cc:289 is pl = hint->open_nearby_postlist(term); Unfortunately, the compiler had optimised a lot of debugging information away. Still, it's clear where the system crashed. Following a similar crash and re-running the query we could not reproduce the crash, but valgrind catched a read-after-free on the `hint` field (full valgrind log attached): ==2265126== Invalid read of size 8 ==2265126== at 0x9A6B313: LocalSubMatch::open_post_list(std::string const&, unsigned int, double, bool, bool, QueryOptimiser*, bool) (localsubmatch.cc:289) which got free'd in this codepath: ==2265126== Address 0xb3fdfd0 is 0 bytes inside a block of size 216 free'd ==2265126== at 0x4C2A360: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2265126== by 0x99C0B75: operator() (queryinternal.cc:65) ==2265126== by 0x99C0B75: for_each<__gnu_cxx::__normal_iterator<Xapian::PostingIterator::Internal**, std::vector<Xapian::PostingIterator::Internal*> >, delete_ptr<Xapian::PostingIterator::Internal> > (stl_algo.h:3755) ==2265126== by 0x99C0B75: Xapian::Internal::Context::reset() (queryinternal.cc:155) ==2265126== by 0x99C481F: Xapian::Internal::QueryWindowed::postlist_windowed(Xapian::Query::op, Xapian::Internal::AndContext&, QueryOptimiser*, double) const (queryinternal.cc:1668) which is triggered by this condition in queryinternal.cc#L1665 ([2]); if (!qopt->db.has_positions()) { // No positions in this subdatabase so this matches nothing, // which means the whole andcontext matches nothing. ctx.reset(); return; } How to fix this ===========Here is what I believe is happening: We are using subdatabases (all glass) for caching recent database additions, and some of these do not contain positional information. This makes the internal query code reset the AND-context, which in effect frees its postlist. One of the postlist entries is still pointed at by the `hint` field of QueryOptimiser from a previous submatch, and the next call to `get_hint_postlist` in queryoptimiser.h#L106 references the bogus address. A fix to avoid this is simple: just reset the QueryOptimiser hint field when free'ing the context postlist. I've written a one-liner for this here [3]. But I'm not yet convinced that's all there is: could the hint field be used already somewhere else? Should we probably keep it and make QueryOptimiser take ownership? [1] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L51 [2] https://github.com/xapian/xapian/blob/master/xapian-core/api/queryinternal.cc#L1665 [3] https://github.com/rsto/xapian/commit/3e7d65b25eef00347f5c764af5ff4d770433ac9b [4] https://github.com/xapian/xapian/blob/master/xapian-core/matcher/queryoptimiser.h#L106 Cheers, Robert -------------- next part -------------- A non-text attachment was scrubbed... Name: valgrind.log Type: application/octet-stream Size: 5361 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170731/a0dd55ba/attachment.obj>
On Mon, Jul 31, 2017 at 09:24:29AM +0200, Robert Stepanek wrote:> Since a couple of weeks we are experiencing occasional segmentation > faults within Xapian 1.5. We can't reproduce the crashes, but we have > strong hints that they are due to memory corruption. We have narrowed > down our root cause analysis to phrase searches on multi-databases that > fail on reading the `hint` field in the`QueryOptimiser`class [1]. > > We'd appreciate any hints on how to fix this. I've written up our > findings and solution attempts below. Should we post this on trac?Yes, it'd be good to have a ticket to track this.> In a core dump we see that calling the `open_nearby_postlist` function > on the `hint` variable [2] falls of a cliff, resulting in a segfault:This mechanism is mainly there to speed up wildcard expansion - in this case we look up a slew of terms in sorted order that will be close to each other, so we cache the PostList object from the previous term and use that as a hint for where to start looking to find the next term. Although it's mainly for wildcards, this mechanism is actually in place for all PostList lookups that happen when a query is run, as other situations can benefit too.> Here is what I believe is happening: > > We are using subdatabases (all glass) for caching recent database > additions, and some of these do not contain positional information. This > makes the internal query code reset the AND-context, which in effect > frees its postlist. One of the postlist entries is still pointed at by > the `hint` field of QueryOptimiser from a previous submatch, and the > next call to `get_hint_postlist` in queryoptimiser.h#L106 references the > bogus address.That sounds very plausible.> A fix to avoid this is simple: just reset the QueryOptimiser hint field > when free'ing the context postlist. I've written a one-liner for this > here [3].At first impression this seems a bit crude as we don't know that the hint is affected, but in practical terms I suspect it's rare that it matters, and it is at least a simple approach.> But I'm not yet convinced that's all there is: could the hint > field be used already somewhere else?I'll have to remind myself how this works in more detail to say for sure if it could be used elsewhere, but I suspect it can't be.> Should we probably keep it and make QueryOptimiser take ownership?That's exactly how I fixed a similar earlier issue (see commit 2299e1d21e39f1295c81833ccd5037f746f4744a). We should probably address both issues consistently. If setting the hint to NULL works, we should probably evaluate that approach as it avoids the overhead of tracking ownership and of checking if we're about to delete the current hint. The code that's causing the problem was added in commit 7a4d4f5f597aba1c625014220856fe5bc786bcc0 which was backported for 1.4.3, so I'd expect this also affects 1.4.3. Cheers, Olly
Olly, thanks for your feedback. On Mon, Jul 31, 2017, at 23:29, Olly Betts wrote:> On Mon, Jul 31, 2017 at 09:24:29AM +0200, Robert Stepanek wrote: > > We'd appreciate any hints on how to fix this. I've written up our > > findings and solution attempts below. Should we post this on trac? > > Yes, it'd be good to have a ticket to track this.I've created ticket #752 (https://trac.xapian.org/ticket/752).> > Should we probably keep it and make QueryOptimiser take ownership? > > That's exactly how I fixed a similar earlier issue (see commit > 2299e1d21e39f1295c81833ccd5037f746f4744a). > > We should probably address both issues consistently. If setting the hint > to NULL works, we should probably evaluate that approach as it avoids > the overhead of tracking ownership and of checking if we're about to > delete the current hint.We are testing the patch to set hint to NULL currently. I'll wait this week to see if we experience any crashers related to it. I am not deep enough into the query optimiser code to know which approach is better in term of performance. If taking ownership is preferred I could make the context resetter aware which postlist member to skip during the free(). Cheers, Robert