thr3ads.net - Xapian devel - GSOC 2017 Project: Learning to Rank Click Data Mining [Mar 2017]

If this information is useful, please help other people find it:
Share via:

YuLun Cai

2017-Mar-13 17:10 UTC

GSOC 2017 Project: Learning to Rank Click Data Mining

I am interested in the project 'Learning to Rank Click Data Mining', and
here is my current understanding about this project:
1. where can we get your click data. we can extend the omega to supports
log the user's search and clicked documents
2. the specific click data information and format. Based on some paper and
public query dataset format(AOL search query logs[1] and Sogou chinese
query logs[2]), I think click data should contain: user identifier like ip
or cookies, timestamp, query content, the showing document id list, clicked
document id. The specific format is:
User identifier \t timestamp \t query contents \t the showing document id
list \t
clicked document id \n
3. relevance judgements. I have read some papers about the relevance
judgement by click model. In specific, [3] use a Dynamic Bayesian network
which considers the results set as a whole and takes into account the
influence of the other urls While estimating the relevance of a given url
from click logs, effectively reducing the position bias — urls appearing in
lower positions are less likely to be clicked even if they are relevant.
[4] propose an efficient discriminative parameter estimation in a
multiple instance learning algorithm (MIL) to automatically produce true
relevance labels for <query, URL> pairs, the basic idea of MIL framework
is
that during training, instances are presented in bags, and labels are
provided for the bags rather than individual instances. If a bag is labeled
positive, it is assumed to contain at least one positive instance. A
negative bag means that all instances in the bag are negative. From a
collection of labeled bags, the classifier tries to figure out which
instance in a positive bag is the most “correct”.
4. I do not understand fully about ' what a sensible workflow is, for
people who want this to be run automatically on a regular basis to update
their Letor model ', is it means that some users want to automatically get
addition training data on a regular basis and then update(retraining) the
Letor module, so we should provide a sensible workflow and docs to them?
Based on the above understanding, here is my plan about the next period:
1. achieve the logging function about omega, the first step is familiar
with omega and save search query and the results successfully.
2. read more papers about click model and put forward an effective way to
judge the relevance based on the paper.
Looking forward to your opinion and please correct me if I am wrong.
Thanks!

references:
[1]
http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs
[2] http://www.sogou.com/labs/resource/q.php
[3] Chapelle O, Zhang Y. A dynamic bayesian network click model for web
search ranking[C]// International Conference on World Wide Web. ACM,
2009:1-10.
[4] Song H, Miao C, Shen Z. Generating true relevance labels in chinese
search engine using clickthrough data[C]// AAAI Conference on Artificial
Intelligence. AAAI Press, 2011:1230-1236.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170314/5e0b3b24/attachment.html>

James Aylett

2017-Mar-19 11:55 UTC

head link

GSOC 2017 Project: Learning to Rank Click Data Mining

On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote:
> 1. where can we get your click data.  we can extend the omega to supports
log the user's search and clicked documents
I think it's also important to have clear documentation for the logging we
use, so that non-omega users can benefit from this work. Note that omega already
supports some logging, using $log{}.
> 2. the specific click data information and format. Based on some paper and
public query dataset format(AOL search query logs[1] and Sogou chinese query
logs[2]), I think click data should contain: user identifier like ip or cookies,
timestamp, query content, the showing document id list, clicked document id.
You need to make a judgement call as to whether training on historical data from
the specific Xapian use, or something like the AOL general query logs, is going
to produce better results. (I'm guessing that will depend on the method used
to generate relevance judgements.)

If you're just using these are references for coming up with a format, then
this isn't an issue (but I'd recommend driving your format based on what
information you need, rather than on what others have done).
> The specific format is:
> User identifier \t timestamp \t query contents \t the showing document id
list \t
> clicked document id \n
It doesn't have to be a user identifier; it can be a search or session id.
(This can have different privacy implications, although we don't want to try
to give recommendations on that side of things.)
> 3. relevance judgements.
[snip]

It sounds like you're looking at the right kind of resources for this. Your
proposal should be detailed about which route(s) you think are most likely to
yield a helpful approach, since this will affect your timeline. (If it makes
sense to do some tests of different approaches, that is something you can
propose to do during the community bonding period, if you have time, or at the
start of the project itself.)
> 4. I do not understand fully about ' what a sensible workflow is, for
people who want this to be run automatically on a regular basis to update their
Letor model ', is it means that some users want to  automatically get
addition training data on a regular basis and then update(retraining) the Letor
module, so we should provide a sensible workflow and docs to them?
Yes. Particularly for small sites, or startups who might be evolving rapidly,
the detail of what constitutes a good search result may change quite rapidly. In
general, this will always change over time even with large and stable sites. So
it's important to be able to update the trained Letor model.
> Based on the above understanding, here is my plan about the next period:
> 1. achieve the logging function about omega, the first step is familiar
with omega and save search query and the results successfully.
I'd start this by writing a detailed plan of what you intend to implement.

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org

YuLun Cai

2017-Mar-21 05:03 UTC

head link

GSOC 2017 Project: Learning to Rank Click Data Mining

Hi, James


Thanks for your reply.


I think it's also important to have clear documentation for the logging
we> use, so that non-omega users can benefit from this work. Note that omega
> already supports some logging, using $log{}.


I'm quite agree with you that clear documentation is important. And thanks
for point out that omega already supports some logging，I will look through
it.


 but I'd recommend driving your format based on what information you
need,> rather than on what others have done


Yes, I will consider what information is important for relevance judgement
first, I look at the AOL general query logs because I think that they will
log the most important thing, which can guide me to find the information I
need.


It doesn't have to be a user identifier; it can be a search or session
id.> (This can have different privacy implications, although we don't want
to
> try to give recommendations on that side of things.)


for the user identifier what I mean is the identifier  that the search
query cones from, which actually the same as a search or session id or
 search IP.


I'd start this by writing a detailed plan of what you intend to implement.



I'm trying to write a draft  proposal and will summit it to the GSOC
website soon.

Thanks


2017-03-19 19:55 GMT+08:00 James Aylett <james-xapian at tartarus.org>:
> On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote:
>
> > 1. where can we get your click data.  we can extend the omega to
> supports log the user's search and clicked documents
>
> I think it's also important to have clear documentation for the logging
we
> use, so that non-omega users can benefit from this work. Note that omega
> already supports some logging, using $log{}.
>
> > 2. the specific click data information and format. Based on some paper
> and public query dataset format(AOL search query logs[1] and Sogou chinese
> query logs[2]), I think click data should contain: user identifier like ip
> or cookies, timestamp, query content, the showing document id list, clicked
> document id.
>
> You need to make a judgement call as to whether training on historical
> data from the specific Xapian use, or something like the AOL general query
> logs, is going to produce better results. (I'm guessing that will
depend on
> the method used to generate relevance judgements.)
>
> If you're just using these are references for coming up with a format,
> then this isn't an issue (but I'd recommend driving your format
based on
> what information you need, rather than on what others have done).
>
> > The specific format is:
> > User identifier \t timestamp \t query contents \t the showing document
> id list \t
> > clicked document id \n
>
> It doesn't have to be a user identifier; it can be a search or session
id.
> (This can have different privacy implications, although we don't want
to
> try to give recommendations on that side of things.)
>
> > 3. relevance judgements.
>
> [snip]
>
> It sounds like you're looking at the right kind of resources for this.
> Your proposal should be detailed about which route(s) you think are most
> likely to yield a helpful approach, since this will affect your timeline.
> (If it makes sense to do some tests of different approaches, that is
> something you can propose to do during the community bonding period, if you
> have time, or at the start of the project itself.)
>
> > 4. I do not understand fully about ' what a sensible workflow is,
for
> people who want this to be run automatically on a regular basis to update
> their Letor model ', is it means that some users want to  automatically
get
> addition training data on a regular basis and then update(retraining) the
> Letor module, so we should provide a sensible workflow and docs to them?
>
> Yes. Particularly for small sites, or startups who might be evolving
> rapidly, the detail of what constitutes a good search result may change
> quite rapidly. In general, this will always change over time even with
> large and stable sites. So it's important to be able to update the
trained
> Letor model.
>
> > Based on the above understanding, here is my plan about the next
period:
> > 1. achieve the logging function about omega, the first step is
familiar
> with omega and save search query and the results successfully.
>
> I'd start this by writing a detailed plan of what you intend to
implement.
>
> J
>
> --
>  James Aylett, occasional troublemaker & project governance
>  xapian.org
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170321/287ee034/attachment-0001.html>

Reasonably Related Threads

Search for more reasonably related threads

Xapian devel - Mar 2017 - GSOC 2017 Project: Learning to Rank Click Data Mining

GSOC 2017 Project: Learning to Rank Click Data Mining

GSOC 2017 Project: Learning to Rank Click Data Mining

GSOC 2017 Project: Learning to Rank Click Data Mining

Reasonably Related Threads