Michiel Roding
2004-Dec-17 10:06 UTC
[Xapian-devel] Custom weight factors - pushing the relevancy ranking how we want it
Hi guys (and gals?), We're using Xapian/Omega for indexing and searching forums. As forums are, the content that is relevant to a search is not just determined by the frequency or location of the terms; the date the topic has been last modified is important as well. Another issue we find is that the amount of results is so overwhelming, the user is unable to find the correct topic for his needs. Combining this with some statistics, we found that a very large part of the queries to Omega are the same. Keywords like windows, xp, dvd etc. are very popular. Therefore, we are contemplating to build a "does this topic meet your search?" feature to store which topics are most relevant to the queries as defined by the users. Other features could be a lame attempt at the PageRank relevancy, storing if a user almost immediatly skips a topic (irrelevant) etc. But, this needs to be stored (easy) and processed by Xapian in the sorting. How could we go about this? Does Xapian somehow support these custom weight factors? Michiel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-devel/attachments/20041217/42d4337a/attachment.htm
James Aylett
2004-Dec-17 10:28 UTC
[Xapian-devel] Custom weight factors - pushing the relevancy ranking how we want it
On Fri, Dec 17, 2004 at 11:06:41AM +0100, Michiel Roding wrote:> As forums are, the content that is relevant to a search is not just > determined by the frequency or location of the terms; the date the topic > has been last modified is important as well. > Another issue we find is that the amount of results is so overwhelming, > the user is unable to find the correct topic for his needs. Combining this > with some statistics, we found that a very large part of the queries to > Omega are the same. Keywords like windows, xp, dvd etc. are very popular. > Therefore, we are contemplating to build a "does this topic meet your > search?" feature to store which topics are most relevant to the queries as > defined by the users. > Other features could be a lame attempt at the PageRank relevancy, storing > if a user almost immediatly skips a topic (irrelevant) etc. > > But, this needs to be stored (easy) and processed by Xapian in the > sorting. > > How could we go about this? Does Xapian somehow support these custom > weight factors?There's currently no way of using document values (pieces of information stored about the document) in the mix to calculate weights - you can write your own Weight scheme, but without access to the docid you can't look up things like this. I don't know enough about the internals of the matcher to know what a performance hit adding this kind of support would be. Two things occur to me. Firstly, you could have a special term which you mix in to your probabilistic term list which means "this is a good topic". So if Xgoodtopic exists once, it means one user liked it. (You could add them automatically if the user doesn't skip, or you could make it explicit.) You might then have some luck with playing with the wdf of that term to boost some documents. The problem is that you'll end up with the corpus frequency of that term being very high, which will downplay the effect of it on document rank. I suppose you could have XT<topic> for every single-term search, so XTwindows, XTdvd and so on. That would keep the corpus frequency down to a more manageable level, perhaps. The other thing is that you make have luck with trying to automatically segment your top results. Say you grab the first 20, you could then see how similar these results are. One way of doing this that might work (but Olly or Richard will be able to give you a better answer :-) would be to get the ESet for the query with the RSet as each document in the MSet in turn, throwing the terms from the ESet back into the query and seeing which other documents from the original MSet come out of that new query. That should enable you to group related results to some extent, although it will depend on how your topics work to some extent. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
Olly Betts
2004-Dec-17 10:37 UTC
[Xapian-devel] Custom weight factors - pushing the relevancy ranking how we want it
On Fri, Dec 17, 2004 at 11:06:41AM +0100, Michiel Roding wrote:> As forums are, the content that is relevant to a search is not just > determined by the frequency or location of the terms; the date the topic > has been last modified is important as well.The match bias code will probably be useful here. I need to tidy up the UI, which is slowly bubbling up my todo list. But it'll allow a date dependent extra weight term so more recent topics can get a boost.> Another issue we find is that the amount of results is so overwhelming, > the user is unable to find the correct topic for his needs. Combining > this with some statistics, we found that a very large part of the > queries to Omega are the same. Keywords like windows, xp, dvd etc. are > very popular. > Therefore, we are contemplating to build a "does this topic meet your > search?" feature to store which topics are most relevant to the queries > as defined by the users.One problem with this approach is that different users may want different results for the same query. Some searching for "xp" may want windows xp, others extreme programming. Hopefully you'll end up with a small enough set of favoured results that this won't be too much of a problem though. And if you built a second Xapian database where each topic is indexed only by terms which people have voted for, then you could use topterms to allow users to narrow in on particular meanings.> Other features could be a lame attempt at the PageRank relevancy, > storing if a user almost immediatly skips a topic (irrelevant) etc. > > But, this needs to be stored (easy) and processed by Xapian in the sorting. > > How could we go about this? Does Xapian somehow support these custom > weight factors?The match bias code again. Currently it's hardwired to expect a Unix timestamp and to give an exponentially decaying weight from the present. But the concept is that it could be used for all sorts of things, including this. Cheers, Olly