thr3ads.net - Xapian devel - [Xapian-devel] Proposed changes to omindex [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Michael Trinkala

2006-Aug-11 05:52 UTC

[Xapian-devel] Proposed changes to omindex

Proposed changes to omindex

Currently Available Items
========================
1) Have the Q prefix contain the 16 byte MD5 of the full file name used for
document lookup during
indexing.

2) Add the document?s last modified time to the value table (ID 0).  This would
allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX
time string) i.e.
0969492426
b. However, for maximum space savings it could be stored as a 4 byte string in
big endian format
with a get/set utility function to handle the conversion if necessary.

3) Add the document?s MD5 to the value table as a 16 byte string (binary
representation of the
digest) (ID 1).  This could be used as a secondary check for incremental
indexing (i.e. if the
file was touched but not changed don?t replace it) and also to collapse
duplicates (COLLAPSE=1).
The md5 source code is from the GNU testutils-2.1 package.

4) For files that require command line utility processing (i.e. pdftotext) I
have added a
--copylocal option.  This allows the file to be digested while being copied to
the local drive and
then the command line utility processes the local file saving multiple reads
across the network.
If we want to expand this it could be used to build a local
cache/backup/repository.  For my use I
was thinking of putting the files under source control (svn) but that is another
discussion
thread.

5) I would also recommend storing the full filename in the document data.
file=/mnt/vol1/www/sample.html.  I have a purge utility that cleans out
documents that are no
longer found on the file system using this information.  FYI: I am currently
migrating to a MySQL
metadata repository that will move information like this out of the search
index; it also
preserves metadata on complete index rebuilds and allows users to add additional
information that
may not be contained in the actual document.

Future Items
===========6) Stream indexer.  Instead of reading the entire file into memory,
process it line by line.  This
should make indexing large files more efficient.

7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run pdfinfo once
and parse the output
ourselves.  I woudl use pcre to extract the desired text.

8) Change the way stemmed terms are added to the database.  Remove the R prefix
from raw terms and
only write stemmed terms to the DB if they differ from the original term,
prefixing them with Z?.
If stemming was set to none this would reduce the current term tables (termlist,
postlist, and
position) by about 50%. The query parser would have to be modified to use the
same rules.

Let me know if you are interested in including any of these changes in Xapian.

Thanks,
Trink

Reini Urban

2006-Aug-11 06:45 UTC

head link

[Xapian-devel] Proposed changes to omindex

Michael Trinkala schrieb:> Proposed changes to omindex
> 
> Currently Available Items
> ========================> 
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for
document lookup during
> indexing.
> 
> 2) Add the document?s last modified time to the value table (ID 0).  This
would allow incremental
> indexing based on the timestamp and also sorting by date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string (left zero padded
UNIX time string) i.e.
> 0969492426
> b. However, for maximum space savings it could be stored as a 4 byte string
in big endian format
> with a get/set utility function to handle the conversion if necessary.
> 
> 3) Add the document?s MD5 to the value table as a 16 byte string (binary
representation of the
> digest) (ID 1).  This could be used as a secondary check for incremental
indexing (i.e. if the
> file was touched but not changed don?t replace it) and also to collapse
duplicates (COLLAPSE=1).
> The md5 source code is from the GNU testutils-2.1 package.
> 
> 4) For files that require command line utility processing (i.e. pdftotext)
I have added a
> --copylocal option.  This allows the file to be digested while being copied
to the local drive and
> then the command line utility processes the local file saving multiple
reads across the network.
> If we want to expand this it could be used to build a local
cache/backup/repository.  For my use I
> was thinking of putting the files under source control (svn) but that is
another discussion
> thread.
I already have a cache_dir option in my omega.conf and successfully use 
it in omindex for recursive local zip/rar/msg/pst "virtual
directories",
last_mod checked. MSVC not supported, sorry.
I'll clean it up and post it here.
Your idea to cache the output of costly extracters, like xls2cvs and 
pdftotext seems to be also promising. But with the implemented last_mod 
check not really needed IMHO.
> 5) I would also recommend storing the full filename in the document data.
> file=/mnt/vol1/www/sample.html.  I have a purge utility that cleans out
documents that are no
> longer found on the file system using this information.  FYI: I am
currently migrating to a MySQL
> metadata repository that will move information like this out of the search
index; it also
> preserves metadata on complete index rebuilds and allows users to add
additional information that
> may not be contained in the actual document.
> 
> Future Items
> ===========> 6) Stream indexer.  Instead of reading the entire file into
memory, process it line by line.  This
> should make indexing large files more efficient.
> 
> 7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run pdfinfo
once and parse the output
> ourselves.  I woudl use pcre to extract the desired text.
> 
> 8) Change the way stemmed terms are added to the database.  Remove the R
prefix from raw terms and
> only write stemmed terms to the DB if they differ from the original term,
prefixing them with Z?.
> If stemming was set to none this would reduce the current term tables
(termlist, postlist, and
> position) by about 50%. The query parser would have to be modified to use
the same rules.
> 
> Let me know if you are interested in including any of these changes in
Xapian.-- 
Reini Urban
http://phpwiki.org/  http://murbreak.at/
http://helsinki.at/  http://spacemovie.mur.at/

James Aylett

2006-Aug-19 18:22 UTC

head link

[Xapian-devel] Proposed changes to omindex

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.
I don't think this is generally useful, for reasons previously given:
omega/omindex are really targetted to indexing and searching web
sites, where the URI is the identifier. A filename used to provide a
representation of that resource isn't at all interesting to omega, and
is only partly interesting to omindex (ie: there are other ways of
doing it). omindex is pretty limited in any case, and if you're doing
anything funky you'll be using scriptindex or your own indexer. Within
that, how you generate Q-terms and manage your documents is of course
entirely up to you.
> 4) For files that require command line utility processing
> (i.e. pdftotext) I have added a --copylocal option.  This allows the
> file to be digested while being copied to the local drive and then
> the command line utility processes the local file saving multiple
> reads across the network. If we want to expand this it could be used
> to build a local cache/backup/repository.  For my use I was thinking
> of putting the files under source control (svn) but that is another
> discussion thread.
This is neat. I agree that for anything more complex it's not actually
going to solve all the requirements, but for remote files it can
work. (Although any decent network fs has built-in caching, and in any
case you could rely on the OS buffers - if you open() first, then dup
the filedes, then use fdopen() to turn it into a FILE* - twice -
there's very little reason you'll have to hit the network twice, even
on a lame net fs. Do you have any timing data on how much this
improves things for you?)
> 5) I would also recommend storing the full filename in the document
> data.  file=/mnt/vol1/www/sample.html.  I have a purge utility that
> cleans out documents that are no longer found on the file system
> using this information.  FYI: I am currently migrating to a MySQL
> metadata repository that will move information like this out of the
> search index; it also preserves metadata on complete index rebuilds
> and allows users to add additional information that may not be
> contained in the actual document.
omindex has its own mechanism for purging documents that no longer
exist. Again, the separation from logical URI to physical storage
pushes me in the direction of not wanting this in omindex.

One idea I've talked to someone about is separating omindex into
something that drives scriptindex, which in theory would allow you to
use the file spider in omindex with whatever indexing strategy you
wanted.

Speaking of metadata, what I'd really like is a Xapian-indexable RDF
store. I doubt anyone else wants one of those though :-)
> 8) Change the way stemmed terms are added to the database.  Remove
> the R prefix from raw terms and only write stemmed terms to the DB
> if they differ from the original term, prefixing them with Z?. If
> stemming was set to none this would reduce the current term tables
> (termlist, postlist, and position) by about 50%. The query parser
> would have to be modified to use the same rules.
Currently, you only get dual terms if the initial letter is a
capital. On a sample database I have here of an old blog, I have:

24535 terms in total
8157 R-terms
1718 other prefixed terms

So we'd get a saving of 33% by dropping R-terms when stemming; however
we'd then lose much if not all of that saving (which I can't calculate
without passing over the original data again) by having to put stemmed
versions back in again, whether an R-term would have been generated or
not. Mind you, a *very* quick test suggests that on some of my data,
no more than 25% of words actually stem to something different. I
suspect this is because there are lots of short words in everyday
English. So there could be some saving here.

If you're not using stemming, and are content to force everything into
lowercase (modulo the excitement that causes with Unicode), dropping
R-terms seems a good strategy. I'd certainly favour having a way of
running the query parser that didn't need R-terms, and then perhaps a
way of driving omindex/scriptindex to not generate them in the first
place. It's a pretty easy change, in index_text.cc:index_text().

I think this all comes down to whether you think stemming is a good
default or not. If you're more concerned about stemmed forms, you want
them to be obvious and probably unprefixed. (It's certainly easier to
debug this way.)
> Let me know if you are interested in including any of these changes
> in Xapian.
I think the best thing is to wait until Olly's back and has a chance
to digest all these and comment on them himself. It's really up to him
what goes in anyway :-)

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org

Olly Betts

2006-Aug-26 21:56 UTC

head link

[Xapian-devel] Proposed changes to omindex

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala
wrote:> Proposed changes to omindex
One suggestion before I go into details - even if some of these patches
may not be things we'd want to include in the mainstream releases right
now, they may still be of interest to some other users.  So I'd
encourage you to offer them for download, or just post them here if they
aren't too big.  The same goes for other people with patches they're
happy to share.
> Currently Available Items
> ========================>
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.
There are two issues here really.

The first is if the unique id should be based on the file path or the
URL.  Currently omindex uses the URL, but the file path could be used
instead.  The main difference I can see is that it would allow the URL
mappings to be changed without a reindex (providing the omega CGI
applied the mappings at search time) but I'm not sure how useful that
really is - I can't remember the last time I reconfigured the url to
file mappings on any webserver I maintain.

On the flip side, currently you can move the physical locations of
files around and change the URL mappings in the web server so the URLs
remain the same, and omindex won't have to reindex a thing.  That
actually seems a more likely scenario to me (though again I can't
remember the last time I've actually done this).

As James has said, we've discussed this before and ended up staying with
how things are, mostly because there didn't seem to be any particular
advantage to changing.

The other issue is that terms have an upper length limit, so you need a
way to cope with overly long URLs/file paths when build UID terms.
Currently we only hash if the URL is over about 240 characters.  The
problem with always using only a hash is that you can get collisions
even with modest length URLs/paths.

While this might seem a bit of a theoretical risk, there are "only"
256^16 MD5 sums, but more than 255^240 file names which would easily fit
in a term - that's more than 10^539 file names per MD5 sum, so really a
very large number of possible collisions!  Even if you only consider
filenames including alphanumerics, "_", "-", and
"/", it's still more
than 10^395 file names per MD5 sum.
> 2) Add the document?s last modified time to the value table (ID 0).
> This would allow incremental
> indexing based on the timestamp and also sorting by date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string (left zero
> padded UNIX time string) i.e. 0969492426
> b. However, for maximum space savings it could be stored as a 4 byte
> string in big endian format with a get/set utility function to handle
> the conversion if necessary.
I think this would be very useful.  I tend to think storing the number
in 4 bytes (or perhaps 5 to take us past 2038...) is worth the effort
since you have to convert the number when storing and retrieving as a
string anyway.  The functions needed are available already (on Unix
at least) as htonl and ntohl.
> 3) Add the document?s MD5 to the value table as a 16 byte string
> (binary representation of the digest) (ID 1).  This could be used as a
> secondary check for incremental indexing (i.e. if the file was touched
> but not changed don?t replace it) and also to collapse duplicates
> (COLLAPSE=1). 
> The md5 source code is from the GNU testutils-2.1 package.
I think this would be useful too.

It'd be marginally better to use a non-GPL md5 implementation (we're
trying to eliminate unrelicensable GPL code from the core library, but
it'd be nice to be able to relicense Omega too).  A quick Google
reveals at least a couple of candidates, though I've not looked at
either in any detail:

http://sourceforge.net/projects/libmd5-rfc/ zlib/libpng License (BSD-ish)
http://www.fourmilab.ch/md5/ public domain

But unless the md5 api is complex, I imagine it'd be easy enough to drop
one of these in instead at a later date.  The GNU version should be very
well tested at least, whereas the above implementations may be less so.
> 4) For files that require command line utility processing (i.e.
> pdftotext) I have added a --copylocal option.  This allows the
> file to be digested while being copied to the local drive and then the
> command line utility processes the local file saving multiple reads
> across the network.
Have you actually benchmarked this?

A decent OS should cache the file's contents and avoid the multiple
reads across the network, so this could end up being slower than just
reading the remote file twice (because the file needs to be written and
flushed to local disk before the filter program gets run).

If it really does help, it seems a useful addition.
> If we want to expand this it could be used to
> build a local cache/backup/repository.  For my use I was thinking of
> putting the files under source control (svn) but that is another
> discussion thread.
I think backup and source control are really outside of the scope of
omindex, unless I misunderstand what you're suggesting here.
> 5) I would also recommend storing the full filename in the document
> data.  file=/mnt/vol1/www/sample.html.  I have a purge utility that
> cleans out documents that are no longer found on the file system using
> this information.
As James says, we have an different approach to purging removed files
during indexing which doesn't require this field.  I don't object
strongly to adding this if it's actually useful though.
> FYI: I am currently migrating to a MySQL metadata repository that will
> move information like this out of the search index; it also preserves
> metadata on complete index rebuilds and allows users to add additional
> information that may not be contained in the actual document.
There's certainly something to be said for keeping information useful
for (re)indexing but not for search in a separate place.  The downside
is that it's hard to flush the Xapian index and metadata store
atomically so you need a robust strategy to cope with indexing being
interrupted when the two aren't in sync.
> Future Items
> ===========> 6) Stream indexer.  Instead of reading the entire file into
memory,
> process it line by line.  This should make indexing large files
> more efficient.
Line-by-line isn't much better - it's not unusual to find long HTML
documents which are all on one line (e.g. those produced on an old Mac
where the end of line character is different, or those generated by
a script).

But some sort of chunked reading isn't a bad idea.  The HTML parser
currently relies on indefinite lookahead which may be awkward to do
while dealing with chunks, but that can probably be fixed without
changing how HTML documents parse in cases which actually matter.
> 7) Clean up the fixme?s in mime type handlers i.e. // FIXME: run
> pdfinfo once and parse the output ourselves.  I woudl use pcre to
> extract the desired text.
Even PCRE is really overkill as you're looking for a constant string
in every case.  It's just sheer laziness that I didn't do it right to
start with.  Sorry.
> 8) Change the way stemmed terms are added to the database.  Remove the
> R prefix from raw terms and only write stemmed terms to the DB if they
> differ from the original term, prefixing them with Z?.
"Z?" doesn't match our existing conventions for prefixes, but the
choice
of prefix is just cosmetic.

This would mean that a search for "words which stem to 'foo'"
would
become foo OR Z?foo, which will be slower and give less accurate
statistics (though they'll probably be some speed gain from reduced
VM file cache pressure in many cases).

So are you suggesting we should generate the non-stemmed terms from
every word?  Currently R terms are only generated for capitalised
words, which is really done to allow searches for a proper nouns
without problems caused by stemming.  However, this feature is
sometimes problematic itself - people type in capitalised words
in queries without knowing about the feature and sometimes the
results returned aren't great.
> If stemming
> was set to none this would reduce the current term tables (termlist,
> postlist, and position) by about 50%. The query parser would
> have to be modified to use the same rules.
Is it really as much as 50% for all of them?  We only generate R terms
for capitalised words, so this suprises me.

I've actually been thinking of reworking how we handle indexing of
stemmed and unstemmed forms myself.  No firm conclusions, but I've
been wondering about indexing all words unstemmed with positional
information and all stemmed forms without.  This would mean that
we could still support phrase searching as we currently implement it,
and NEAR for unstemmed words.  A capitalised word in a query could
search for an unstemmed form, and a non-capitalised word for a
stemmed form.  Also stemming could be turned off at query time.
This would save slightly more space in the position table to your
approach, but not as much in the termlist or postlist table.

Perhaps some combination of our ideas would work.  I think I need to
mull it over more.

Cheers,
    Olly

Apparently Analagous Threads

Search for more seemingly similar threads

Xapian devel - Aug 2006 - Proposed changes to omindex

[Xapian-devel] Proposed changes to omindex

[Xapian-devel] Proposed changes to omindex

[Xapian-devel] Proposed changes to omindex

[Xapian-devel] Proposed changes to omindex

Apparently Analagous Threads