Ayush Tomar
2016-Jun-27 11:53 UTC
xapian-letor: Prefix strategy discussion while indexing and preparing training file
Hello, Following the discussion with James on prefix strategy being used while indexing, at present, while preparing training file in xapian-letor (prepare_training_file() function in api/letor_internal.cc), the following hard-coded prefixes are added to every query from the query file: Xapian::QueryParser parser; parser.add_prefix("title", "S"); parser.add_prefix("subject", "S"); Hence, each query is parsed as follows: title:<query> ... <query>. A user might not have this specific metadata storage in the database or could have some other prefixes that were used while indexing. Anyway, the user's query file should take care of any prefixes in the query string by itself. Hence, is it a good idea to give hard-coded support for these specific prefixes by default? -- ---------------------------------------------------------------------------- Kind Regards, Ayush Tomar | My Webpage <http://ayshtmr.xyz> | LinkedIn <https://in.linkedin.com/in/ayushtomar> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160627/ead677d2/attachment.html>
James Aylett
2016-Jun-28 09:46 UTC
xapian-letor: Prefix strategy discussion while indexing and preparing training file
On Mon, Jun 27, 2016 at 05:23:36PM +0530, Ayush Tomar wrote:> Following the discussion with James on prefix strategy being used while > indexing, at present, while preparing training file in xapian-letor > (prepare_training_file() function in api/letor_internal.cc), the following > hard-coded prefixes are added to every query from the query file: > > Xapian::QueryParser parser; > parser.add_prefix("title", "S"); > parser.add_prefix("subject", "S"); > > Hence, each query is parsed as follows: title:<query> ... <query>.into the terms Sstemmed_word... which is the common Xapian approach for titles.> A user might not have this specific metadata storage in the database or > could have some other prefixes that were used while indexing.This is equivalent to having to match prefix configuration between indexing and searching in non-letor use. (Indeed, that configuration happens again in bin/questletor.cc.)> Anyway, the user's query file should take care of any prefixes in > the query string by itself.I disagree, because the query isn't the same as the terms in the database, and that's something set at index time, and is (as I understand it) independent of the letor training data (which is input data, not Xapian terms). I don't think we want people to have to convert training data (which should be human-understandable) into files full of prefixed terms (Zband ZSband &c).> Hence, is it a good idea to give hard-coded support for these specific > prefixes by default?For the time being, I think it's fine. There are more important things to worry about (such as which aspects of `Letor` belong on the `Ranker`, representing the specialisation at work -- SVM, RankList or whatever -- and which should be in an equivalent of `Enquire`, representing the process of re-ranking an MSet). J -- James Aylett, occasional trouble-maker xapian.org