Tony Lambiris
2009-Jan-19 06:26 UTC
[Xapian-discuss] Is this a correct method of indexing?
I'm kind of new to Xapian and search in general, but I am in the process of working with Xapian to index documents and I am becoming a little confused as to all the functions, as from a top-level appear to accomplish much of the same thing. What I am trying to do now, is basically index a document but I want to add more weight to the document title. After multiple tries with all the various functions (ie: add_term, add_posting, etc), this is what I ended up with: doc.add_term(doc_title, 100); The idea being that if the query matches the exact title, I want to really rank it high. After that I use index_text_without_positions to index the entire document as I won't be using any phrase or NEAR queries, and I also read this method takes up less space. Does it appear I am doing everything correct? I don't know if it's over-kill to index the entire document or not, or if there are any preferred methods. I had toyed with the idea of indexing only the first paragraph of the document, but I wanted to keep the input method totally unobtrusive when it came to the format of the text. All I care about is the title (or file name) and the contents, but I don't know if this is the best approach.... the database grows quite large and indexing slows down dramatically. Thanks in advance for your time.
On Mon, Jan 19, 2009 at 01:26:40AM -0500, Tony Lambiris wrote:> I'm kind of new to Xapian and search in general, but I am in the > process of working with Xapian to index documents and I am becoming a > little confused as to all the functions, as from a top-level appear to > accomplish much of the same thing.Did you read this? http://trac.xapian.org/wiki/FAQ/TermGenerator> What I am trying to do now, is basically index a document but I want > to add more weight to the document title. After multiple tries with > all the various functions (ie: add_term, add_posting, etc), this is > what I ended up with: > doc.add_term(doc_title, 100);Um, that indexes the title as a single term, which I don't think is what you want. You'd have to write your own custom query parsing code for such terms to be used when querying, and there's a limit on the length of terms, so this would fail for long titles.> The idea being that if the query matches the exact title, I want to > really rank it high. After that I use index_text_without_positions to > index the entire document as I won't be using any phrase or NEAR > queries, and I also read this method takes up less space.Yes, it saves having to store data about the positions of terms in documents, which can be quite large.> I don't know if it's over-kill to index the entire document or not, or > if there are any preferred methods.It's only overkill if you don't need to be able to search the whole document.> the database grows quite large and indexing slows down dramatically.How large is "quite large"? The FAQ discusses what sort of database size you should expect. You can usually speed up indexing using XAPIAN_FLUSH_THRESHOLD which is currently set rather conservatively by default: http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#d0077acafa9485c97b73b8726c375732 Cheers, Olly
On Mon 19/01/09 8:26 AM , "Tony Lambiris" tonylambiris at gmail.com sent:> I don't know if it's > over-kill to index the entire document or not, or if there are any > preferred methods. I had toyed with the idea of indexing only the > first paragraph of the document, but I wanted to keep the input method > totally unobtrusive when it came to the format of the text. All I care > about is the title (or file name) and the contents, but I don't know > if this is the best approach.... the database grows quite large and > indexing slows down dramatically.I suppose it depends on the intended purpose of your search app. For typical search engine apps, it's common to index documents up to a specific size (eg, 100KB) only. I suggest you keep it simple. Xapian reminds me of UNIX, there's so many ways of doing things it can be daunting initially. Use a simple TermGenerator, eg with Perl: my $index Search::Xapian::WritableDatabase->new( $index_path, DB_CREATE_OR_OVERWRITE ) my $doc_text = substr ($text, 0, $max); my $tit_weight = 100; my $tg= Search::Xapian::TermGenerator->new; # ...set_stemmer, set_stopper, set_document... # index the body text, letting index_text() take care of # all the term details. $tg->index_text ( $doc_text ); # simulate a 'field' with a prefix, boosting it's weight. # this way you can search for [title:bob] if you want to. # either way, if there's a hit in this title, it'll score a bit higher. $tg->index_text ( $doc_title, $tit_weight, 'XTITLE'); ... $index->add_document($doc); The nice thing about using this simple approach is that it's easier to understand what the hell is going on initially, and you can always expand on it, getting all dirty 'n sexy as needed. Cheers Henry