Proposed changes to omindex Currently Available Items ======================== 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e. 0969492426 b. However, for maximum space savings it could be stored as a 4 byte string in big endian format with a get/set utility function to handle the conversion if necessary. 3) Add the document?s MD5 to the value table as a 16 byte string (binary representation of the digest) (ID 1). This could be used as a secondary check for incremental indexing (i.e. if the file was touched but not changed don?t replace it) and also to collapse duplicates (COLLAPSE=1). The md5 source code is from the GNU testutils-2.1 package. 4) For files that require command line utility processing (i.e. pdftotext) I have added a --copylocal option. This allows the file to be digested while being copied to the local drive and then the command line utility processes the local file saving multiple reads across the network. If we want to expand this it could be used to build a local cache/backup/repository. For my use I was thinking of putting the files under source control (svn) but that is another discussion thread. 5) I would also recommend storing the full filename in the document data. file=/mnt/vol1/www/sample.html. I have a purge utility that cleans out documents that are no longer found on the file system using this information. FYI: I am currently migrating to a MySQL metadata repository that will move information like this out of the search index; it also preserves metadata on complete index rebuilds and allows users to add additional information that may not be contained in the actual document. Future Items ===========6) Stream indexer. Instead of reading the entire file into memory, process it line by line. This should make indexing large files more efficient. 7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run pdfinfo once and parse the output ourselves. I woudl use pcre to extract the desired text. 8) Change the way stemmed terms are added to the database. Remove the R prefix from raw terms and only write stemmed terms to the DB if they differ from the original term, prefixing them with Z?. If stemming was set to none this would reduce the current term tables (termlist, postlist, and position) by about 50%. The query parser would have to be modified to use the same rules. Let me know if you are interested in including any of these changes in Xapian. Thanks, Trink
Michael Trinkala schrieb:> Proposed changes to omindex > > Currently Available Items > ========================> > 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during > indexing. > > 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental > indexing based on the timestamp and also sorting by date in omega (SORT=0) > a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e. > 0969492426 > b. However, for maximum space savings it could be stored as a 4 byte string in big endian format > with a get/set utility function to handle the conversion if necessary. > > 3) Add the document?s MD5 to the value table as a 16 byte string (binary representation of the > digest) (ID 1). This could be used as a secondary check for incremental indexing (i.e. if the > file was touched but not changed don?t replace it) and also to collapse duplicates (COLLAPSE=1). > The md5 source code is from the GNU testutils-2.1 package. > > 4) For files that require command line utility processing (i.e. pdftotext) I have added a > --copylocal option. This allows the file to be digested while being copied to the local drive and > then the command line utility processes the local file saving multiple reads across the network. > If we want to expand this it could be used to build a local cache/backup/repository. For my use I > was thinking of putting the files under source control (svn) but that is another discussion > thread.I already have a cache_dir option in my omega.conf and successfully use it in omindex for recursive local zip/rar/msg/pst "virtual directories", last_mod checked. MSVC not supported, sorry. I'll clean it up and post it here. Your idea to cache the output of costly extracters, like xls2cvs and pdftotext seems to be also promising. But with the implemented last_mod check not really needed IMHO.> 5) I would also recommend storing the full filename in the document data. > file=/mnt/vol1/www/sample.html. I have a purge utility that cleans out documents that are no > longer found on the file system using this information. FYI: I am currently migrating to a MySQL > metadata repository that will move information like this out of the search index; it also > preserves metadata on complete index rebuilds and allows users to add additional information that > may not be contained in the actual document. > > Future Items > ===========> 6) Stream indexer. Instead of reading the entire file into memory, process it line by line. This > should make indexing large files more efficient. > > 7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run pdfinfo once and parse the output > ourselves. I woudl use pcre to extract the desired text. > > 8) Change the way stemmed terms are added to the database. Remove the R prefix from raw terms and > only write stemmed terms to the DB if they differ from the original term, prefixing them with Z?. > If stemming was set to none this would reduce the current term tables (termlist, postlist, and > position) by about 50%. The query parser would have to be modified to use the same rules. > > Let me know if you are interested in including any of these changes in Xapian.-- Reini Urban http://phpwiki.org/ http://murbreak.at/ http://helsinki.at/ http://spacemovie.mur.at/
On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:> 1) Have the Q prefix contain the 16 byte MD5 of the full file name > used for document lookup during indexing.I don't think this is generally useful, for reasons previously given: omega/omindex are really targetted to indexing and searching web sites, where the URI is the identifier. A filename used to provide a representation of that resource isn't at all interesting to omega, and is only partly interesting to omindex (ie: there are other ways of doing it). omindex is pretty limited in any case, and if you're doing anything funky you'll be using scriptindex or your own indexer. Within that, how you generate Q-terms and manage your documents is of course entirely up to you.> 4) For files that require command line utility processing > (i.e. pdftotext) I have added a --copylocal option. This allows the > file to be digested while being copied to the local drive and then > the command line utility processes the local file saving multiple > reads across the network. If we want to expand this it could be used > to build a local cache/backup/repository. For my use I was thinking > of putting the files under source control (svn) but that is another > discussion thread.This is neat. I agree that for anything more complex it's not actually going to solve all the requirements, but for remote files it can work. (Although any decent network fs has built-in caching, and in any case you could rely on the OS buffers - if you open() first, then dup the filedes, then use fdopen() to turn it into a FILE* - twice - there's very little reason you'll have to hit the network twice, even on a lame net fs. Do you have any timing data on how much this improves things for you?)> 5) I would also recommend storing the full filename in the document > data. file=/mnt/vol1/www/sample.html. I have a purge utility that > cleans out documents that are no longer found on the file system > using this information. FYI: I am currently migrating to a MySQL > metadata repository that will move information like this out of the > search index; it also preserves metadata on complete index rebuilds > and allows users to add additional information that may not be > contained in the actual document.omindex has its own mechanism for purging documents that no longer exist. Again, the separation from logical URI to physical storage pushes me in the direction of not wanting this in omindex. One idea I've talked to someone about is separating omindex into something that drives scriptindex, which in theory would allow you to use the file spider in omindex with whatever indexing strategy you wanted. Speaking of metadata, what I'd really like is a Xapian-indexable RDF store. I doubt anyone else wants one of those though :-)> 8) Change the way stemmed terms are added to the database. Remove > the R prefix from raw terms and only write stemmed terms to the DB > if they differ from the original term, prefixing them with Z?. If > stemming was set to none this would reduce the current term tables > (termlist, postlist, and position) by about 50%. The query parser > would have to be modified to use the same rules.Currently, you only get dual terms if the initial letter is a capital. On a sample database I have here of an old blog, I have: 24535 terms in total 8157 R-terms 1718 other prefixed terms So we'd get a saving of 33% by dropping R-terms when stemming; however we'd then lose much if not all of that saving (which I can't calculate without passing over the original data again) by having to put stemmed versions back in again, whether an R-term would have been generated or not. Mind you, a *very* quick test suggests that on some of my data, no more than 25% of words actually stem to something different. I suspect this is because there are lots of short words in everyday English. So there could be some saving here. If you're not using stemming, and are content to force everything into lowercase (modulo the excitement that causes with Unicode), dropping R-terms seems a good strategy. I'd certainly favour having a way of running the query parser that didn't need R-terms, and then perhaps a way of driving omindex/scriptindex to not generate them in the first place. It's a pretty easy change, in index_text.cc:index_text(). I think this all comes down to whether you think stemming is a good default or not. If you're more concerned about stemmed forms, you want them to be obvious and probably unprefixed. (It's certainly easier to debug this way.)> Let me know if you are interested in including any of these changes > in Xapian.I think the best thing is to wait until Olly's back and has a chance to digest all these and comment on them himself. It's really up to him what goes in anyway :-) James -- /--------------------------------------------------------------------------\ James Aylett xapian.org james at tartarus.org uncertaintydivision.org
On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:> Proposed changes to omindexOne suggestion before I go into details - even if some of these patches may not be things we'd want to include in the mainstream releases right now, they may still be of interest to some other users. So I'd encourage you to offer them for download, or just post them here if they aren't too big. The same goes for other people with patches they're happy to share.> Currently Available Items > ========================> > 1) Have the Q prefix contain the 16 byte MD5 of the full file name > used for document lookup during indexing.There are two issues here really. The first is if the unique id should be based on the file path or the URL. Currently omindex uses the URL, but the file path could be used instead. The main difference I can see is that it would allow the URL mappings to be changed without a reindex (providing the omega CGI applied the mappings at search time) but I'm not sure how useful that really is - I can't remember the last time I reconfigured the url to file mappings on any webserver I maintain. On the flip side, currently you can move the physical locations of files around and change the URL mappings in the web server so the URLs remain the same, and omindex won't have to reindex a thing. That actually seems a more likely scenario to me (though again I can't remember the last time I've actually done this). As James has said, we've discussed this before and ended up staying with how things are, mostly because there didn't seem to be any particular advantage to changing. The other issue is that terms have an upper length limit, so you need a way to cope with overly long URLs/file paths when build UID terms. Currently we only hash if the URL is over about 240 characters. The problem with always using only a hash is that you can get collisions even with modest length URLs/paths. While this might seem a bit of a theoretical risk, there are "only" 256^16 MD5 sums, but more than 255^240 file names which would easily fit in a term - that's more than 10^539 file names per MD5 sum, so really a very large number of possible collisions! Even if you only consider filenames including alphanumerics, "_", "-", and "/", it's still more than 10^395 file names per MD5 sum.> 2) Add the document?s last modified time to the value table (ID 0). > This would allow incremental > indexing based on the timestamp and also sorting by date in omega (SORT=0) > a. Currently I store the timestamp as a 10 byte string (left zero > padded UNIX time string) i.e. 0969492426 > b. However, for maximum space savings it could be stored as a 4 byte > string in big endian format with a get/set utility function to handle > the conversion if necessary.I think this would be very useful. I tend to think storing the number in 4 bytes (or perhaps 5 to take us past 2038...) is worth the effort since you have to convert the number when storing and retrieving as a string anyway. The functions needed are available already (on Unix at least) as htonl and ntohl.> 3) Add the document?s MD5 to the value table as a 16 byte string > (binary representation of the digest) (ID 1). This could be used as a > secondary check for incremental indexing (i.e. if the file was touched > but not changed don?t replace it) and also to collapse duplicates > (COLLAPSE=1). > The md5 source code is from the GNU testutils-2.1 package.I think this would be useful too. It'd be marginally better to use a non-GPL md5 implementation (we're trying to eliminate unrelicensable GPL code from the core library, but it'd be nice to be able to relicense Omega too). A quick Google reveals at least a couple of candidates, though I've not looked at either in any detail: http://sourceforge.net/projects/libmd5-rfc/ zlib/libpng License (BSD-ish) http://www.fourmilab.ch/md5/ public domain But unless the md5 api is complex, I imagine it'd be easy enough to drop one of these in instead at a later date. The GNU version should be very well tested at least, whereas the above implementations may be less so.> 4) For files that require command line utility processing (i.e. > pdftotext) I have added a --copylocal option. This allows the > file to be digested while being copied to the local drive and then the > command line utility processes the local file saving multiple reads > across the network.Have you actually benchmarked this? A decent OS should cache the file's contents and avoid the multiple reads across the network, so this could end up being slower than just reading the remote file twice (because the file needs to be written and flushed to local disk before the filter program gets run). If it really does help, it seems a useful addition.> If we want to expand this it could be used to > build a local cache/backup/repository. For my use I was thinking of > putting the files under source control (svn) but that is another > discussion thread.I think backup and source control are really outside of the scope of omindex, unless I misunderstand what you're suggesting here.> 5) I would also recommend storing the full filename in the document > data. file=/mnt/vol1/www/sample.html. I have a purge utility that > cleans out documents that are no longer found on the file system using > this information.As James says, we have an different approach to purging removed files during indexing which doesn't require this field. I don't object strongly to adding this if it's actually useful though.> FYI: I am currently migrating to a MySQL metadata repository that will > move information like this out of the search index; it also preserves > metadata on complete index rebuilds and allows users to add additional > information that may not be contained in the actual document.There's certainly something to be said for keeping information useful for (re)indexing but not for search in a separate place. The downside is that it's hard to flush the Xapian index and metadata store atomically so you need a robust strategy to cope with indexing being interrupted when the two aren't in sync.> Future Items > ===========> 6) Stream indexer. Instead of reading the entire file into memory, > process it line by line. This should make indexing large files > more efficient.Line-by-line isn't much better - it's not unusual to find long HTML documents which are all on one line (e.g. those produced on an old Mac where the end of line character is different, or those generated by a script). But some sort of chunked reading isn't a bad idea. The HTML parser currently relies on indefinite lookahead which may be awkward to do while dealing with chunks, but that can probably be fixed without changing how HTML documents parse in cases which actually matter.> 7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run > pdfinfo once and parse the output ourselves. I woudl use pcre to > extract the desired text.Even PCRE is really overkill as you're looking for a constant string in every case. It's just sheer laziness that I didn't do it right to start with. Sorry.> 8) Change the way stemmed terms are added to the database. Remove the > R prefix from raw terms and only write stemmed terms to the DB if they > differ from the original term, prefixing them with Z?."Z?" doesn't match our existing conventions for prefixes, but the choice of prefix is just cosmetic. This would mean that a search for "words which stem to 'foo'" would become foo OR Z?foo, which will be slower and give less accurate statistics (though they'll probably be some speed gain from reduced VM file cache pressure in many cases). So are you suggesting we should generate the non-stemmed terms from every word? Currently R terms are only generated for capitalised words, which is really done to allow searches for a proper nouns without problems caused by stemming. However, this feature is sometimes problematic itself - people type in capitalised words in queries without knowing about the feature and sometimes the results returned aren't great.> If stemming > was set to none this would reduce the current term tables (termlist, > postlist, and position) by about 50%. The query parser would > have to be modified to use the same rules.Is it really as much as 50% for all of them? We only generate R terms for capitalised words, so this suprises me. I've actually been thinking of reworking how we handle indexing of stemmed and unstemmed forms myself. No firm conclusions, but I've been wondering about indexing all words unstemmed with positional information and all stemmed forms without. This would mean that we could still support phrase searching as we currently implement it, and NEAR for unstemmed words. A capitalised word in a query could search for an unstemmed form, and a non-capitalised word for a stemmed form. Also stemming could be turned off at query time. This would save slightly more space in the position table to your approach, but not as much in the termlist or postlist table. Perhaps some combination of our ideas would work. I think I need to mull it over more. Cheers, Olly