Olly Betts writes: > On Wed, Jan 04, 2017 at 07:29:58AM +0100, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > The ticket has a patch which attempts to handle the OR case (which seems > > > to be the part you actually care about) but this suffers from issues with > > > object lifetimes which get a bit involved in the details. Since there > > > wasn't a working patch when we got to making the hard decisions about > > > which tickets to bump to get 1.4.0 out, and since addressing this > > > shouldn't require ABI changes, it got bumped. > > > > Thank you for this answer. > > > > I need to choose between three approaches: > > > > - Implement support at the application level. > > - Shift back to 1.2 > > - Just wait for 1.4.x > > Or help fix up the patch in the ticket? Yep. But I earnestly believe that I'm not up to the task of fiddling with Xapian internals. You may remember that I gave it a try quite a long time ago, (it was the very same issue actually), and that, if I remember well, my change did not quite do what it was supposed to do... > > I'd rather go back to 1.2 than used a patched 1.4 by the way. > > Once we have a working patch, it should be mergable into 1.4.x (I can't > see why any ABI changes would be needed) so using a patched 1.4 > shouldn't be an issue. My phrase was unclear: explanation: I could use a patched 1.4 on Windows where libxapian is bundled with recoll, but I was thinking ahead to a situation where I'd have a 1.2/1.4 choice on Linux, where bundling a patched 1.4 would not be acceptable. In the latter case, I'd rather use 1.2 because of the NEAR issue. > > This all depends on your expected schedule (I guess that this would have > > been a better term than 'plan', which is indeed described in the ticket). I > > am not asking for anything beyond information here. Do you have any idea of > > the very approximate time when the change might be implemented ? > > I had another poke at the patch and have a reworked version which solves the > object lifetime issue and works for some simple tests. Can you try it out > and see if it works for you? > > https://trac.xapian.org/ticket/508#comment:13 > > There are two limitations: > > * Only OP_OR subqueries are handled. I think supporting these would be a > useful step forward by itself, and AIUI it's all you actually need. Yes, my need arises from stem or synonym expansions occurring inside a NEAR query. This happens without the user doing anything special, so it's a problem when it causes an error. Recoll also supports multi-word synonyms which could potentially generate PHRASE subqueries inside NEAR queries, but this understandably already did not work with 1.2, so the multi-word expansions are only used when proximity is not involved (by the way, proximity of phrases does make sense in this case, if there is a wishlist somewhere, but it's admittedly not an issue that most users will be concerned with...). > * Currently the OP_OR subqueries can only have two subqueries of their own. > Lifting this restriction needs a bit of work on the new > OrPositionList class > - the old patch used a series of pairwise OrPositionList objects, but the > new patch needs a single one instead - the class needs reworking to handle > that. > > So I think the second limitation needs addressing, and of course any bugs > resolving. I am not sure that I completely understand this paragraph, but, anyway, although I have a bit of trouble reading my own code, I think that recoll will only add flat OP_OR queries as subqueries of the NEAR one. I tested the patch and it does seem to answer my selfish needs... > I can't promise anything re schedule, but hopefully we can sort this out > fairly soon. At least the solution for what's missing now is fairly clear - > we probably want to put the sub-positionlists into a min heap. See, you lost me with the last phrase, and that's why it's better that I don't get into Xapian-core internals :) Anyway it's good enough to know that a patch exists which will hopefully make its way into 1.4.x, meaning that I have no need to work on a bad application-level solution. Thanks ! Cheers, jf
On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote:> Olly Betts writes: > > On Wed, Jan 04, 2017 at 07:29:58AM +0100, Jean-Francois Dockes wrote: > > > I'd rather go back to 1.2 than used a patched 1.4 by the way. > > > > Once we have a working patch, it should be mergable into 1.4.x (I can't > > see why any ABI changes would be needed) so using a patched 1.4 > > shouldn't be an issue. > > My phrase was unclear: explanation: I could use a patched 1.4 on Windows > where libxapian is bundled with recoll, but I was thinking ahead to a > situation where I'd have a 1.2/1.4 choice on Linux, where bundling a > patched 1.4 would not be acceptable. In the latter case, I'd rather use 1.2 > because of the NEAR issue.I follow, but my point was that once we have a patch we're confident works we can merge it in for the next 1.4.x release - so you shouldn't need to consider this as an option anyway.> Recoll also supports multi-word synonyms which could potentially generate > PHRASE subqueries inside NEAR queries, but this understandably already did > not work with 1.2, so the multi-word expansions are only used when proximity > is not involved (by the way, proximity of phrases does make sense in this > case, if there is a wishlist somewhere, but it's admittedly not an issue > that most users will be concerned with...).Another case for https://trac.xapian.org/ticket/508 I think.> > * Currently the OP_OR subqueries can only have two subqueries of their own. > > Lifting this restriction needs a bit of work on the new > > OrPositionList class > > - the old patch used a series of pairwise OrPositionList objects, but the > > new patch needs a single one instead - the class needs reworking to handle > > that. > > > > So I think the second limitation needs addressing, and of course any bugs > > resolving. > > I am not sure that I completely understand this paragraph, but, anyway, > although I have a bit of trouble reading my own code, I think that recoll > will only add flat OP_OR queries as subqueries of the NEAR one. I tested > the patch and it does seem to answer my selfish needs...The code I pushed before wouldn't handle an OR of more than two things, so you couldn't do a 3+-way stem expansion: (text OR texts) NEAR (search OR searches OR searched OR searching) But I've just pushed an update which will handle this.> > I can't promise anything re schedule, but hopefully we can sort this out > > fairly soon. At least the solution for what's missing now is fairly clear - > > we probably want to put the sub-positionlists into a min heap. > > See, you lost me with the last phrase, and that's why it's better that I > don't get into Xapian-core internals :)A heap is a datastructure which is good for merging ordered lists, and a min-heap just means that the tip of the heap is the smallest entry (a max-heap is probably more common). But I tried a heap and having looked at how things work in practice I concluded the heap really only benefits advancing to the next position, whereas the common operation is skipping to at least position N. In practice the cost of maintaining the heap cancels out the savings, so I've pushed a simpler approach to the branch: https://github.com/ojwb/xapian/tree/orpositionlist Can you give that some real-world testing? Cheers, Olly
Olly Betts writes: > On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote: > > > Recoll also supports multi-word synonyms which could potentially > > generate PHRASE subqueries inside NEAR queries, but this > > understandably already did not work with 1.2, so the multi-word > > expansions are only used when proximity is not involved (by the way, > > proximity of phrases does make sense in this case, if there is a > > wishlist somewhere, but it's admittedly not an issue that most users > > will be concerned with...). > > Another case for https://trac.xapian.org/ticket/508 I think. The ticket only lists OP_OR as subqueries, but this case would be a bit different, because one of the OR subqueries would actually be a phrase: filesystem NEAR otherterm may be transformed by synonym expansion into: (filesystem OR (file PHRASE system)) NEAR otherterm > > > * Currently the OP_OR subqueries can only have two subqueries of > > > their own. Lifting this restriction needs a bit of work on the > > > new OrPositionList class > > > - the old patch used a series of pairwise OrPositionList > > > objects, but the > > > new patch needs a single one instead - the class needs reworking > > > to handle that. > > > > > > So I think the second limitation needs addressing, and of course > > > any bugs resolving. > > > > I am not sure that I completely understand this paragraph, but, anyway, > > although I have a bit of trouble reading my own code, I think that recoll > > will only add flat OP_OR queries as subqueries of the NEAR one. I tested > > the patch and it does seem to answer my selfish needs... > > The code I pushed before wouldn't handle an OR of more than two things, > so you couldn't do a 3+-way stem expansion: > > (text OR texts) NEAR (search OR searches OR searched OR searching) > > But I've just pushed an update which will handle this. Ok, I hadn't even noticed the limitation. Dit it silently truncated the OR list ? I did not have a formal test case for this, I just saw that the error message was gone, and that the results appeared reasonable. > > > I can't promise anything re schedule, but hopefully we can sort > > > this out fairly soon. At least the solution for what's missing now > > > is fairly clear - we probably want to put the sub-positionlists > > > into a min heap. > > > > See, you lost me with the last phrase, and that's why it's better that I > > don't get into Xapian-core internals :) > > A heap is a datastructure which is good for merging ordered lists, and > a min-heap just means that the tip of the heap is the smallest entry > (a max-heap is probably more common). Thanks for the explanation. > But I tried a heap and having looked at how things work in practice I > concluded the heap really only benefits advancing to the next position, > whereas the common operation is skipping to at least position N. In > practice the cost of maintaining the heap cancels out the savings, so > I've pushed a simpler approach to the branch: > > https://github.com/ojwb/xapian/tree/orpositionlist > > Can you give that some real-world testing? It's not real real-world, but I built a contrived set of files on pairs of stemmed words (all 2-word combinations of floor/floors/floored/flooring), and the last commit of the branch works fine, finding all docs for recollish "floor floor"p, which yields a Xapian request of: ((floors OR flooring OR floored OR floor) NEAR 12 (floors OR flooring OR floored OR floor)) But, actually, so does the previous version (commit 389dfb319a66), which explains why I had not understood what the limitation was. Both versions also work fine with "floor floor floor"p: (floors OR flooring OR floored OR floor) NEAR 13 (floors OR flooring OR floored OR floor) NEAR 13 (floors OR flooring OR floored OR floor) So: me happy but confused... Cheers, jf