Gilles Polart-Donat
2005-Oct-29 11:34 UTC
[Xapian-discuss] How to deal with "zones" on documents
Hello, I'm a newbie on Xapian, and after installing it (easyly, so thanks to development team) I have some questions. I want to have a different weight on result for a term, if it comes from differents parts of a document. For example, on HTML files, from tags <title>, <h1>, <h2>, ... Must I do it with different database (one per "zone") and make a query with all databases ? Is there a doc on that, or a thread on mailing list archive, to read ? I test Xapian to use it on a search engine for the collaborative software I work on (Mioga2 : http://www.mioga2.org). For this time, we use mifluz from Senga, but it is a dead project. Best regards Gilles Polart-Donat PS : I'm a french guy with poor english, sorry. ;-)
On Sat, Oct 29, 2005 at 12:40:40PM +0200, Gilles Polart-Donat wrote:> I want to have a different weight on result for a term, if it comes from > differents parts of a document. > > For example, on HTML files, from tags <title>, <h1>, <h2>, ...Assuming you know the relative weights you want to apply to the different tags, then just apply extra wdf to terms generated from these fields. You do this by passing a value greater than 1 for the optional third argument to Document::add_posting() (or the optional second argument to Docuemnt::add_term()). E.g. doc.add_posting(term, pos, wdfinc); doc.add_term(term, wdfinc); So "wdfinc" is the "extra weight factor". Note that it must be an integer. If you want to be able to tune the factors dynamically, you could index terms from particular tags with particular prefixes (e.g. the term "xapian" in <h1> might be XH1:xapian). Then for each term in the query, you'd produced an OR of the various possible forms with the wqf (within query frequency) set appropriately for each form: (xapian OR XH1:xapian OR XH2:xapian OR ...) The static way is likely to be more efficient, so use that if you can. Cheers, Olly