Hi! Everytime I think I've got the xapian search for MusicBrainz licked I ask for more feedback and my community finds yet another test case that throws a monkey-wrench into my project. And the more I try to understand Xapian's weighting system, the less I really understand it. Let me ask a specific question -- in my release index (an index of CD titles, essentially) I have a field called type. When the value of this field is "album" I give it a termcount of 100. All other values for this field and all other fields get a termcount of 1. For the enquire, I use a stock object. I do not define a weighting system, do not tinker with doc order or sort order. When I search for the term "love" in the release title (very common term), the top hits are the ones that contain the word "love" twice. Good. But, for all the hits that have the word "love" in them once, I would expect to see the releases of type "album" to be near the top. But they are not: http://musicbrainz.homeip.net/search/textsearch.html?query=love&handlearguments=1&limit=25&type=release&adv=0&offset=0 They make up the *bottom* 3-4 pages of the results, meaning they got ranked BELOW all the non-"album" values: http://musicbrainz.homeip.net/search/textsearch.html?query=love&handlearguments=1&limit=25&type=release&adv=0&offset=250 I can clearly see that my weighting is having an effect, but its the opposite effect from what I am expecting. What am I missing here? Any tips would be appreciated! -- --ruaok Somewhere in Texas a village is *still* missing its idiot. Robert Kaye -- rob at eorbit.net -- http://mayhem-chaos.net
On Tue, Jul 01, 2008 at 04:15:53PM -0700, Robert Kaye wrote:> Let me ask a specific question -- in my release index (an index of CD > titles, essentially) I have a field called type. When the value of > this field is "album" I give it a termcount of 100. All other values > for this field and all other fields get a termcount of 1.I assume you're using scriptindex, and are turning that scriptindex input field into a term (probably a boolean term). Are you applying the termcount *solely* to the type field? That won't do much here, and other emergent properties are giving you the result you see. If you're bumping the termcount to 100 for all terms you generate for a Xapian document with type=album, that's a different matter. From your description, I don't think you are doing - can you confirm one way or the other? J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james at tartarus.org uncertaintydivision.org
On Tue, Jul 01, 2008 at 04:15:53PM -0700, Robert Kaye wrote:> Let me ask a specific question -- in my release index (an index of CD > titles, essentially) I have a field called type. When the value of > this field is "album" I give it a termcount of 100. All other values > for this field and all other fields get a termcount of 1. > > For the enquire, I use a stock object. I do not define a weighting > system, do not tinker with doc order or sort order. When I search for > the term "love" in the release title (very common term), the top hits > are the ones that contain the word "love" twice. Good. > > But, for all the hits that have the word "love" in them once, I would > expect to see the releases of type "album" to be near the top. But > they are not:Are you adding this type term to queries? If not, the effect of indexing the type term with those termcounts will be to increase the document length of albums. That will tend to decrease the importance of each occurrence of "love" in the album title, so albums will indeed tend to rank lower. Perhaps a better approach would be to keep the type term with wdf 1 regardless of the type, and then take your query and adjust it like so: Xapian::Query album_boost("XTYPEalbum"); album_boost = Xapian::Query(Xapian::Query::OP_SCALE_WEIGHT, album_boost, 4.2); query = Xapian::Query(Xapian::Query::OP_AND_MAYBE, query, album_boost); You can adjust the 4.2 factor to alter how much albums are boosted, and you can also search "fairly", or boost individual tracks instead if you prefer - and none of this requires a reindex. Cheers, Olly
On Jul 1, 2008, at 6:14 PM, Olly Betts wrote:> Are you adding this type term to queries? If not, the effect of > indexing the type term with those termcounts will be to increase the > document length of albums. That will tend to decrease the > importance of > each occurrence of "love" in the album title, so albums will indeed > tend > to rank lower.Ah ha -- that explains it -- thanks.> Xapian::Query album_boost("XTYPEalbum"); > album_boost = Xapian::Query(Xapian::Query::OP_SCALE_WEIGHT, > album_boost, 4.2); > query = Xapian::Query(Xapian::Query::OP_AND_MAYBE, query, > album_boost);OK, I see how that can be really useful. Since I am providing an end user search service, should I write my own parser and generate my own queries or should I post-process the results from QueryParser to tack on the fields that would give the user better search results? -- --ruaok Somewhere in Texas a village is *still* missing its idiot. Robert Kaye -- rob at eorbit.net -- http://mayhem-chaos.net