I am interested in the project 'Learning to Rank Click Data Mining', and here is my current understanding about this project: 1. where can we get your click data. we can extend the omega to supports log the user's search and clicked documents 2. the specific click data information and format. Based on some paper and public query dataset format(AOL search query logs[1] and Sogou chinese query logs[2]), I think click data should contain: user identifier like ip or cookies, timestamp, query content, the showing document id list, clicked document id. The specific format is: User identifier \t timestamp \t query contents \t the showing document id list \t clicked document id \n 3. relevance judgements. I have read some papers about the relevance judgement by click model. In specific, [3] use a Dynamic Bayesian network which considers the results set as a whole and takes into account the influence of the other urls While estimating the relevance of a given url from click logs, effectively reducing the position bias — urls appearing in lower positions are less likely to be clicked even if they are relevant. [4] propose an efficient discriminative parameter estimation in a multiple instance learning algorithm (MIL) to automatically produce true relevance labels for <query, URL> pairs, the basic idea of MIL framework is that during training, instances are presented in bags, and labels are provided for the bags rather than individual instances. If a bag is labeled positive, it is assumed to contain at least one positive instance. A negative bag means that all instances in the bag are negative. From a collection of labeled bags, the classifier tries to figure out which instance in a positive bag is the most “correct”. 4. I do not understand fully about ' what a sensible workflow is, for people who want this to be run automatically on a regular basis to update their Letor model ', is it means that some users want to automatically get addition training data on a regular basis and then update(retraining) the Letor module, so we should provide a sensible workflow and docs to them? Based on the above understanding, here is my plan about the next period: 1. achieve the logging function about omega, the first step is familiar with omega and save search query and the results successfully. 2. read more papers about click model and put forward an effective way to judge the relevance based on the paper. Looking forward to your opinion and please correct me if I am wrong. Thanks! references: [1] http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs [2] http://www.sogou.com/labs/resource/q.php [3] Chapelle O, Zhang Y. A dynamic bayesian network click model for web search ranking[C]// International Conference on World Wide Web. ACM, 2009:1-10. [4] Song H, Miao C, Shen Z. Generating true relevance labels in chinese search engine using clickthrough data[C]// AAAI Conference on Artificial Intelligence. AAAI Press, 2011:1230-1236. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170314/5e0b3b24/attachment.html>
On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote:> 1. where can we get your click data. we can extend the omega to supports log the user's search and clicked documentsI think it's also important to have clear documentation for the logging we use, so that non-omega users can benefit from this work. Note that omega already supports some logging, using $log{}.> 2. the specific click data information and format. Based on some paper and public query dataset format(AOL search query logs[1] and Sogou chinese query logs[2]), I think click data should contain: user identifier like ip or cookies, timestamp, query content, the showing document id list, clicked document id.You need to make a judgement call as to whether training on historical data from the specific Xapian use, or something like the AOL general query logs, is going to produce better results. (I'm guessing that will depend on the method used to generate relevance judgements.) If you're just using these are references for coming up with a format, then this isn't an issue (but I'd recommend driving your format based on what information you need, rather than on what others have done).> The specific format is: > User identifier \t timestamp \t query contents \t the showing document id list \t > clicked document id \nIt doesn't have to be a user identifier; it can be a search or session id. (This can have different privacy implications, although we don't want to try to give recommendations on that side of things.)> 3. relevance judgements.[snip] It sounds like you're looking at the right kind of resources for this. Your proposal should be detailed about which route(s) you think are most likely to yield a helpful approach, since this will affect your timeline. (If it makes sense to do some tests of different approaches, that is something you can propose to do during the community bonding period, if you have time, or at the start of the project itself.)> 4. I do not understand fully about ' what a sensible workflow is, for people who want this to be run automatically on a regular basis to update their Letor model ', is it means that some users want to automatically get addition training data on a regular basis and then update(retraining) the Letor module, so we should provide a sensible workflow and docs to them?Yes. Particularly for small sites, or startups who might be evolving rapidly, the detail of what constitutes a good search result may change quite rapidly. In general, this will always change over time even with large and stable sites. So it's important to be able to update the trained Letor model.> Based on the above understanding, here is my plan about the next period: > 1. achieve the logging function about omega, the first step is familiar with omega and save search query and the results successfully.I'd start this by writing a detailed plan of what you intend to implement. J -- James Aylett, occasional troublemaker & project governance xapian.org
Hi, James Thanks for your reply. I think it's also important to have clear documentation for the logging we> use, so that non-omega users can benefit from this work. Note that omega > already supports some logging, using $log{}.I'm quite agree with you that clear documentation is important. And thanks for point out that omega already supports some logging,I will look through it. but I'd recommend driving your format based on what information you need,> rather than on what others have doneYes, I will consider what information is important for relevance judgement first, I look at the AOL general query logs because I think that they will log the most important thing, which can guide me to find the information I need. It doesn't have to be a user identifier; it can be a search or session id.> (This can have different privacy implications, although we don't want to > try to give recommendations on that side of things.)for the user identifier what I mean is the identifier that the search query cones from, which actually the same as a search or session id or search IP. I'd start this by writing a detailed plan of what you intend to implement. I'm trying to write a draft proposal and will summit it to the GSOC website soon. Thanks 2017-03-19 19:55 GMT+08:00 James Aylett <james-xapian at tartarus.org>:> On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote: > > > 1. where can we get your click data. we can extend the omega to > supports log the user's search and clicked documents > > I think it's also important to have clear documentation for the logging we > use, so that non-omega users can benefit from this work. Note that omega > already supports some logging, using $log{}. > > > 2. the specific click data information and format. Based on some paper > and public query dataset format(AOL search query logs[1] and Sogou chinese > query logs[2]), I think click data should contain: user identifier like ip > or cookies, timestamp, query content, the showing document id list, clicked > document id. > > You need to make a judgement call as to whether training on historical > data from the specific Xapian use, or something like the AOL general query > logs, is going to produce better results. (I'm guessing that will depend on > the method used to generate relevance judgements.) > > If you're just using these are references for coming up with a format, > then this isn't an issue (but I'd recommend driving your format based on > what information you need, rather than on what others have done). > > > The specific format is: > > User identifier \t timestamp \t query contents \t the showing document > id list \t > > clicked document id \n > > It doesn't have to be a user identifier; it can be a search or session id. > (This can have different privacy implications, although we don't want to > try to give recommendations on that side of things.) > > > 3. relevance judgements. > > [snip] > > It sounds like you're looking at the right kind of resources for this. > Your proposal should be detailed about which route(s) you think are most > likely to yield a helpful approach, since this will affect your timeline. > (If it makes sense to do some tests of different approaches, that is > something you can propose to do during the community bonding period, if you > have time, or at the start of the project itself.) > > > 4. I do not understand fully about ' what a sensible workflow is, for > people who want this to be run automatically on a regular basis to update > their Letor model ', is it means that some users want to automatically get > addition training data on a regular basis and then update(retraining) the > Letor module, so we should provide a sensible workflow and docs to them? > > Yes. Particularly for small sites, or startups who might be evolving > rapidly, the detail of what constitutes a good search result may change > quite rapidly. In general, this will always change over time even with > large and stable sites. So it's important to be able to update the trained > Letor model. > > > Based on the above understanding, here is my plan about the next period: > > 1. achieve the logging function about omega, the first step is familiar > with omega and save search query and the results successfully. > > I'd start this by writing a detailed plan of what you intend to implement. > > J > > -- > James Aylett, occasional troublemaker & project governance > xapian.org > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170321/287ee034/attachment-0001.html>