Olly Betts
2005-Oct-18 06:07 UTC
[Xapian-devel] Re: [Xapian-commits] 6355: trunk/xapian-applications/omega/ trunk/xapian-applications/omega/docs/
On Fri, Jul 29, 2005 at 10:08:13AM +0100, james wrote:> SVN root: svn://svn.xapian.org/xapian > Changes by: james > Revision: 6355 > Date: 2005-07-29 10:08:13 +0100 (Fri, 29 Jul 2005) > > Log message (6 lines): > omindex.cc: add --preserve-nonduplicates / -p option to not delete any > documents that aren't updated, in replace duplicates mode (so that > multiple runs of omindex on different subsites don't stomp on each > other).This fix seems to be avoiding the real issue, so it's less than ideal I feel. Looking at the code, what it's really doing is turning off half of "skip_duplicates" - the bit at the end of the run where we delete any documents we've not seen (on the assumption that they've been deleted from the document tree since the previous index run). (Although I notice it still creates and updates the bitmap we use to track deleted documents, but that's easy enough to fix...) The half of "skip_duplicates" which it leaves enabled is the code to replace documents which have the same URL (rather than not updating them as "skip_duplicates" does). The motivation for this option is as described in the log message above, and this is a genuine problem with my deleted document removal code. But if I have multiple subsites, deleted documents should still get removed from the index, which is why I don't think this is the right approach. Arranging to delete the right documents might not be too hard. All documents for a particular subsite are indexed by the same H and P term combination so we can just check each deletion candidate against those two postlists (hurrah for skip_to!) That should be pretty efficient. The only problem I can see is that if indexroot is specified, we also need to check each remaining deletion candidate against that, which I think means we have to look in the document data for each one. Ick, that's probably going to be slow. Or can anyone can see a way around this issue? We could just outlaw such partial updates, but that's probably unreasonable. Perhaps disabling deletion in that case would do for now. At least it's a more unusual situation and it doesn't need a special switch. The other approach I can see is to move to having a configuration file which describes what the index should contain. Then omindex would be able to process all subsites in one pass, and so the "updated" map would be correct. It also has the benefit that removing a whole subsite works. However updates of single subsites, or sections of subsites still look like they'd be awkward, so this doesn't seem to address the hard part of the problem. Actually, I also wonder if even skip_duplicates should really be disabling the deletion. It would be easy and pretty cheap to look up the document id for each skipped document and flag it as "updated" so it didn't get deleted... I think the reason it currently doesn't is just an oversight on my part. Thoughts? It would be good to sort this out for 0.9.3, which I'm starting to think about. Cheers, Olly
James Aylett
2005-Oct-18 10:47 UTC
[Xapian-devel] Re: [Xapian-commits] 6355: trunk/xapian-applications/omega/ trunk/xapian-applications/omega/docs/
On Tue, Oct 18, 2005 at 07:07:34AM +0100, Olly Betts wrote:> > omindex.cc: add --preserve-nonduplicates / -p option to not delete any > > documents that aren't updated, in replace duplicates mode (so that > > multiple runs of omindex on different subsites don't stomp on each > > other). > > This fix seems to be avoiding the real issue, so it's less than ideal I > feel.I think the real issue is that omindex is trying to model two different ways of looking at the world, one simple (without subsites) and one more complex but very, very specific (with subsites). It tends to bite new users, and it requires quite different bits of code to handle the different options - yet it's all smershed together in the hope that it'll all be okay, basically because of people gradually adding features (first me, to support something I needed, then you for something else). Currently omindex embodies TIMTOWTDI, which probably isn't ideal for an out-of-the-box basic search system.> Looking at the code, what it's really doing is turning off half of > "skip_duplicates" - the bit at the end of the run where we delete > any documents we've not seen (on the assumption that they've been > deleted from the document tree since the previous index run). > > (Although I notice it still creates and updates the bitmap we use to > track deleted documents, but that's easy enough to fix...)I just needed a quick fix for someone :-) [The new code will not delete old documents in subsites]> Arranging to delete the right documents might not be too hard. All > documents for a particular subsite are indexed by the same H and P term > combination so we can just check each deletion candidate against those > two postlists (hurrah for skip_to!) That should be pretty efficient.Yeah, I was just lazy.> The only problem I can see is that if indexroot is specified, we also > need to check each remaining deletion candidate against that, which I > think means we have to look in the document data for each one. Ick, > that's probably going to be slow. Or can anyone can see a way around > this issue?Changing the way omindex works to drop indexroot and make it all a lot more obvious? This is a serious suggestion, by the way - I'm pretty sure we can come up with a better model for omindex that doesn't confuse the hell out of people when they first meet it.> We could just outlaw such partial updates, but that's probably > unreasonable. Perhaps disabling deletion in that case would do for now. > At least it's a more unusual situation and it doesn't need a special > switch.Providing we document it clearly, that would probably be fine.> The other approach I can see is to move to having a configuration file > which describes what the index should contain. Then omindex would be > able to process all subsites in one pass, and so the "updated" map would > be correct. It also has the benefit that removing a whole subsite > works. However updates of single subsites, or sections of subsites still > look like they'd be awkward, so this doesn't seem to address the hard > part of the problem.Thinking quickly, but how about an omindex that uses a config file which lists the subsites, where they are on disk and so forth *but* omindex can also work in "simple" mode, without subsites, without a configuration file at all. I think that means that the awkwardness sits entirely within the code, and deals with our new user problem (which is that omindex isn't the easiest thing to drive; and indeed I tend to have to read the code to remind myself of fiddly details). J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james at tartarus.org uncertaintydivision.org
Possibly Parallel Threads
- How to omindex some sub-directories?
- How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
- Fwd: Is there a front-end for using xapian-omega rather than the terminal? Could a Xapian database be accessed from web-browsers?
- omega issues/notes
- omega: omindex behaviour with duplicate files