Ivan Sutter
2008-Sep-26 16:11 UTC
[Xapian-discuss] Need more explanations about Xapian's expanding
I there, I'm using Xapian with a database containing movies, TV shows etc (about 35000) and actors (about 35000 too). The indexing and the basic search process well, but the results with the expanding is not very relevant. I'm drawing inspiration from simpleexpand.php5, and I am trying to play with the values needed by get_mset() and get_eset() to find best suggesting result. First, I get the top 40 results : $matches = $enquire->get_mset(0, 40, $rset); That works well, even if 40 is often enough (I mean I often get less that 40 results). Then, I call this part (sorry for the stupid copy-paste) : // If no relevant docids were given, invent an RSet containing the top 5 // matches (or all the matches if there are less than 5). if ($rset->is_empty()) { $c = 20; // so here I've put 20 instead of 5... $i = $matches->begin(); while ($c-- && !$i->equals($matches->end())) { $rset->add_document($i->get_docid()); $i->next(); } } And in fact that's weird because my $rset is empty but it's called in the previous get_mset() ! I've missed something. Finally, I'm getting the suggestions : $eset = $enquire->get_eset(10, $rset); As you can see, I'm not mastering all these lines ... I just wish some help to know how these "ratios" (the 40, 20 and 5) are affecting the result. Don't worry, I've run tests, but according to the amount of data, it's hard to know if I've find a true good result or if it's just luck ! So a "scientific" explanation would be grate ! Thanks in advance.
Olly Betts
2008-Sep-30 05:13 UTC
[Xapian-discuss] Need more explanations about Xapian's expanding
On Fri, Sep 26, 2008 at 06:11:40PM +0200, Ivan Sutter wrote:> First, I get the top 40 results : > $matches = $enquire->get_mset(0, 40, $rset); > That works well, even if 40 is often enough (I mean I often get less that 40 > results). > > Then, I call this part (sorry for the stupid copy-paste) : > // If no relevant docids were given, invent an RSet containing the top 5 > // matches (or all the matches if there are less than 5). > if ($rset->is_empty()) { > $c = 20; // so here I've put 20 instead of 5... > $i = $matches->begin(); > while ($c-- && !$i->equals($matches->end())) { > $rset->add_document($i->get_docid()); > $i->next(); > } > } > And in fact that's weird because my $rset is empty but it's called in the > previous get_mset() ! I've missed something. > > Finally, I'm getting the suggestions : > $eset = $enquire->get_eset(10, $rset); > > As you can see, I'm not mastering all these lines ... I just wish some help > to know how these "ratios" (the 40, 20 and 5) are affecting the result.Well, "40" is just the size of the MSet you've requested. "20" is how many documents from the MSet you are adding to the RSet, and "5" is how many the example you start from added. I suspect 20 is too many - you want the RSet to contain genuinely relevant documents. Ideally the user would pick the relevant documents, but you can often get reasonable results by assuming that the top few entries from the MSet are relevant. But the more you add, the more likely that some won't actually be relevant - I would guess that 20 is too high, especially if you are often getting less than 40 results in total. You could probably look at how the MSet weights vary to pick a cut-off dynamically. I've not done tests, but it seems likely that you don't want to keep adding documents once the weights drop sharply. I wonder if you meant "10" not "5"? "10" is the number of terms you'd like in the ESet.> Don't worry, I've run tests, but according to the amount of data, it's hard > to know if I've find a true good result or if it's just luck ! > So a "scientific" explanation would be grate !I'm not sure "science" can automatically give you good values for the number of documents to add to the auto-generated RSet and the number of relevant terms to ask for. You probably do want to run some tests to empirically validate the numbers you're using. Cheers, Olly