Henry
2008-Nov-18 12:35 UTC
[Xapian-discuss] add_posting(): term position significance - line or offset?
Greets, WRT add_posting() and the term's position: presumably it's best to use the actual offset in the source as the position, rather than the line number containing the term, right? I take it this may result in more accurate phrase searching, and better general search results since term items' proximity would increase their score. Correct? Thanks Henry
Richard Boulton
2008-Nov-18 16:38 UTC
[Xapian-discuss] add_posting(): term position significance - line or offset?
Henry wrote:> Greets, > > WRT add_posting() and the term's position: presumably it's best to > use the actual offset in the source as the position, rather than the > line number containing the term, right?The usual use is to store the "word number" at which a word appears, and this is probably what you want. However, you could store the line number if you wanted: phrase searches (with a window of phrase-size) would then match when the words were fairly spread out (ie, up to one per line). I recommend using word number, anyway, unless you have a very odd situation I've not thought of.> I take it this may result in more accurate phrase searching, and > better general search results since term items' proximity would > increase their score.Note that Xapian currently doesn't modify the weight of a phrase based on how close together the terms are - phrase searches either match a phrase (in which case the weight is the sum of the weights of the constituent terms), or don't match the phrase (in which case the phrase contributes no weight, and the document won't be returned (unless other parts of the query match it)). This is something that could be improved, but we haven't had the time (or motivation) to fix it yet... -- Richard
Henry
2008-Nov-18 17:18 UTC
[Xapian-discuss] add_posting(): term position significance - line or offset?
> The usual use is to store the "word number" at which a word appears, > and this is probably what you want. However, you could store the > line number if you wanted: phrase searches (with a window of > phrase-size) would then match when the words were fairly spread out > (ie, up to one per line). > > I recommend using word number, anyway, unless you have a very odd > situation I've not thought of.Thanks - I hadn't even thought of word number.> Note that Xapian currently doesn't modify the weight of a phrase > based on how close together the terms are ...Sorry, I wasn't very clear: I was thinking in terms of normal non-phrase searches. ie, searching for [ +candle +stick ] in: "...the candle stick was made of gold..." would score higher (because of the proximity of the words, posting weights aside) than: "...the boy decided to use a stick made of wood to break the candle..." where 'stick' and 'candle' are further apart. Anyway, you've answered my question, thanks! Regards Henry
Henry
2008-Nov-18 18:29 UTC
[Xapian-discuss] add_posting(): term position significance - line or offset?
> Not currently... > > Cheers, > OllyPity - is that an issue which needs to be addressed in search-code only, or indexing and search? Hmm, based on my admittedly superficial understanding of Xapian so far: if the positional info is available for all term postings, then could the search code not be extended to score higher for terms closer together? This to my mind would be a rather important aspect of scoring, and one which I'd like to explore with a view to possible sponsored development (small personal purse, so don't get too excited:). Thoughts? Henry