John Pye
2007-Jul-12 09:48 UTC
[Xapian-discuss] omega: omindex behaviour with duplicate files
Hi all I need a little clarification with regard to Omega's behaviour with 'duplicate' files when running 'omindex'. How is a duplicate recognised? Is it simply by file path? How is an unmodified file detected, if at all? I would like to set up subversion post-commit hook to update my index. If possible I would like to just update the index with the newly commited files. What is the most efficient way to do this? Is it something that has already been implemented by others? Secondly, is there any way that the verbosity of the omindex output can be reduced? I would like it if there were a '--quiet' option that only output information about files that were actually being reindexed. I would like to set up this post-commit hook so that documents deleted from the repository are correctly removed from the index. At present my post-commit hook script works by brute force, and looks like this: #!/bin/sh cd /data/omegadocs && svn up omindex -d ignore --db /var/lib/omega/data/default --url /svn/ /data/omegadocs If there are any tips for improving this, it would be much appreciated. Cheers JP -- John Pye Department of Mechanical and Manufacturing Engineering University of New South Wales, Sydney, Australia http://pye.dyndns.org/
James Aylett
2007-Jul-12 11:29 UTC
[Xapian-discuss] omega: omindex behaviour with duplicate files
On Thu, Jul 12, 2007 at 06:48:39PM +1000, John Pye wrote:> I need a little clarification with regard to Omega's behaviour with > 'duplicate' files when running 'omindex'. > > How is a duplicate recognised? Is it simply by file path? How is an > unmodified file detected, if at all?It's done by constructed URL path. You could use the calculated MD5 hash to do modification detection, but it doesn't right now.> I would like to set up subversion post-commit hook to update my index. > If possible I would like to just update the index with the newly > commited files. What is the most efficient way to do this? Is it > something that has already been implemented by others?Right now this can't be done using omindex. I *think* I posted a potential patch a while back (or possibly just how to write the code) so that you could provide a filename instead of a directory to omindex. If you combine that with the -p switch, you can reindex a single file at a time.> Secondly, is there any way that the verbosity of the omindex output can > be reduced? I would like it if there were a '--quiet' option that only > output information about files that were actually being reindexed.That's a good idea, but there's no way of doing it without changing the code right now. If you can identify which messages you think should be eliminated in --quiet mode, I can make the changes for you.> I would like to set up this post-commit hook so that documents deleted > from the repository are correctly removed from the index. At present my > post-commit hook script works by brute force, and looks like this: > > #!/bin/sh > cd /data/omegadocs && svn up > omindex -d ignore --db /var/lib/omega/data/default --url /svn/ > /data/omegadocs > > If there are any tips for improving this, it would be much appreciated.I'd recommend using scriptindex for this, which can delete a single document (or several documents) more efficiently. However you do have to be able to generate the unique U-term that omindex uses, which is based on the constructed URL. It only gets fiddly if the URL is long - delve(1) will help you construct them in the shorter cases, if you can't read the omindex C++ source to find out the details. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org