thr3ads.net - Xapian devel - GSoC 2017: Letor Click Data Mining [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Vivek Pal

2017-Mar-22 14:27 UTC

GSoC 2017: Letor Click Data Mining

Hi James,
> Isn't this from the query template, ie from the main web page of search
> results? (It might make sense from opensearch as well, though.)
Yes, you are right; it is the query template. The reason I said opensearch
template is that I haven't quite read all sections of the Omega docs and
I'm
still in the process. Thanks for pointing that out.

I'm aiming to cover most of it in a day or two to have a good understanding
of
how the project will fit in. However, I won't be able to cover all the
Omega-
-Script commands but atleast the most related ones like $log.
> We need some way of logging when people click on a search result — which
> you can build using a second omegascript template, as Olly suggested.
Okay, so it will act between the query template and a linked document pointed
by a search result. Do you think we need to make this new template transparent
to the user in some way as we might have to record some information such as
user ids in the form of IP? In any case, we'll need a way to distinguish
between different users by assigning unique ids to them.
> So the only thing you really need to know is the ENTRY format, so you can
> figure out how to log what you need. (Which you should identify before
> diving into code.)
I see; though it would be helpful to also have an example in the documentation
for the same? There's a DEFAULT_LOG_ENTRY string in query.cc that I can
across
while on the word_in_list PR:

"$or{$env{REMOTE_HOST},$env{REMOTE_ADDR},-}\t"
"[$date{$now,%d/%b/%Y:%H:%M:%S} +0000]\t"
"$if{$cgi{X},add,$if{$cgi{MORELIKE},morelike,query}}\t"
"$dbname\t"
"$query\t"
"$msize$if{$env{HTTP_REFERER},\t$env{HTTP_REFERER}}";

Could you explain the meaning of third and and last strings?
> You need to think more carefully about the layers involved here. We
don't
> want to post-process the output of a template...
Yes, so I thought about it in detail and I think the whole process would like
the following from a broad perspective:

1. Rearrangement: Input the original results to the FairPairs which will
rearrange them and the rearranged results will be presented on the query
template.

2. Logging: Log the required data using a new template and store it in an
appropriate format for further processing.

3. Click Models: These are successors of preference pair models which I
mentioned earlier. We have some options here as descibed in book "Click
 Models for Web Search" such as DBN, DCN, CCN etc. which will be trained
on a relevance dataset to provide us with relevance scores of results links in
our logs using which we'll generate Qrel file as used by xapian-letor.

To train a click model, we'd need a relevance prediction dataset that should
contain human generated binary relevance labels for query-document pairs.
I'm curious to know from where we can obtain such a dataset. One that I know
of is Yandex web seach challenge dataset on Kaggle.

And, thanks for the link to MSet re-ordering system. I'll check out ideas
that
were discussed there.
> That page is ancient, so I hope you're actually installing the 1.4
series
> Xapian and Omega!
Latest stable release is 1.4 series but I actually have 1.5 series installed
which I think is because I installed dev version from latest git master. I
don't think that should be a problem here?
> That looks to me like you haven't installed omega, but are trying to
run
> with the development version
I've all xapian related executables in /usr/local/bin including omindex.
Does
that suggest Omega is installed?
> When you ran `make install` for omega, it will have copied the CGI
somewhere
In /usr/local/lib/xapian-omega/bin, I can't find CGI but these file:
mhtml2html, omega, outlookmsg2html, rfc822tohtml and vcard2text.
> More generally, I'd recommend reading the omega documentation.
Yes, I'll go through it. I'll give it a second try after reading the
docs and
may be ask for help with setting up Omega on IRC if I run into an issue again.

Thanks,
Vivek

James Aylett

2017-Mar-22 20:08 UTC

head link

GSoC 2017: Letor Click Data Mining

On 22 Mar 2017, at 14:27, Vivek Pal <vivekpal.dtu at gmail.com> wrote:
>> We need some way of logging when people click on a search result —
which
>> you can build using a second omegascript template, as Olly suggested.
> 
> Okay, so it will act between the query template and a linked document
pointed
> by a search result. Do you think we need to make this new template
transparent
> to the user in some way as we might have to record some information such as
> user ids in the form of IP? In any case, we'll need a way to
distinguish
> between different users by assigning unique ids to them.
You could do that by identifying the search session instead of the user, which
makes it closer to what we need than to something that might trip you into
privacy concerns.
>> So the only thing you really need to know is the ENTRY format, so you
can
>> figure out how to log what you need. (Which you should identify before
>> diving into code.)
> 
> I see; though it would be helpful to also have an example in the
documentation
> for the same?
We don't really need an example; however I didn't read the documentation
carefully, so it may warrant rewording. Or maybe I should just be more diligent
in future.
> There's a DEFAULT_LOG_ENTRY string in query.cc that I can across
> while on the word_in_list PR:
> 
> "$or{$env{REMOTE_HOST},$env{REMOTE_ADDR},-}\t"
> "[$date{$now,%d/%b/%Y:%H:%M:%S} +0000]\t"
> "$if{$cgi{X},add,$if{$cgi{MORELIKE},morelike,query}}\t"
> "$dbname\t"
> "$query\t"
> "$msize$if{$env{HTTP_REFERER},\t$env{HTTP_REFERER}}";
> 
> Could you explain the meaning of third and and last strings?
Third records some information about what sort of query it is — add, morelike or
a plain query. Last provides the estimated match size and then the HTTP referrer
if one were set. Neither is particularly interesting in this case.
> 3. Click Models: These are successors of preference pair models which I
> mentioned earlier. We have some options here as descibed in book
"Click
> Models for Web Search" such as DBN, DCN, CCN etc. which will be
trained
> on a relevance dataset to provide us with relevance scores of results links
in
> our logs using which we'll generate Qrel file as used by xapian-letor.
… and you'll need a way to use letor from omega, or you'll have trained
a model for no good reason :)
> Latest stable release is 1.4 series but I actually have 1.5 series
installed
> which I think is because I installed dev version from latest git master. I
> don't think that should be a problem here?
No, that's even better. I just didn't want you to be using the very old
version mentioned in the walkthrough :)
>> That looks to me like you haven't installed omega, but are trying
to run
>> with the development version
> 
> I've all xapian related executables in /usr/local/bin including
omindex. Does
> that suggest Omega is installed?
Yes. But if you follow the walkthrough, it copies the uninstalled version of the
omega CGI.
>> When you ran `make install` for omega, it will have copied the CGI
somewhere
> 
> In /usr/local/lib/xapian-omega/bin, I can't find CGI but these file:
> mhtml2html, omega, outlookmsg2html, rfc822tohtml and vcard2text.
omega is the CGI (I think).

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

Vivek Pal

2017-Mar-23 06:18 UTC

head link

GSoC 2017: Letor Click Data Mining

> You could do that by identifying the search session instead of the user,
> which makes it closer to what we need than to something that might trip you
> into privacy concerns.
Okay, that would be much better. :)
> Third records some information about what sort of query it is — add,
> morelike or a plain query. Last provides the estimated match size and then
> the HTTP referrer if one were set. Neither is particularly interesting in
> this case.
Thanks for the explanation. So, as I understand it, we'll need some more
info
to be logged than this to be able to train click models for relevance judgeme-
-nts.
> and you'll need a way to use letor from omega, or you'll have
trained a
> model for no good reason :)
Sorry, I may have misunderstood you here but why would we need a way to use
letor from omega? For training Letor module, wouldn't we just need two files
i.e. Query and Qrel as mentioned in the xapian-letor docs? Letor API can then
generate the final training file using those two files.

And to mine the relevance judgements for Qrel file from logs, we'll need to
train one of the click models such as DBM etc..

Is there a better way to mine the relevance judgements than click models?
> Yes. But if you follow the walkthrough, it copies the uninstalled version
> of the omega CGI. omega is the CGI (I think).
Oh, I thought it'd be a .cgi file. Okay, so I just need to copy this omega
from /usr/local/lib/xapian-omega/bin to usr/lib/cgi-bin and work with it.

Thanks,
Vivek

Apparently Analagous Threads

Search for more reasonably related threads

Xapian devel - Mar 2017 - GSoC 2017: Letor Click Data Mining

GSoC 2017: Letor Click Data Mining

GSoC 2017: Letor Click Data Mining

GSoC 2017: Letor Click Data Mining

Apparently Analagous Threads