thr3ads.net - Xapian devel - Logging the click data [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Vivek Pal

2017-Jun-03 21:08 UTC

Logging the click data

Hi,

I spent this week mostly understanding how the existing templates work and
setting up Omega CGI on my system to have a better grasp over how things
actually look like when using different templates.

This helped me look at things associated with logging the click data from
a better perspective. As already documented on the project's wiki page,
we need the following fields in separater columns:

1. ID: some identifier for each entry
2. QUERY: text of the query
3. URLs: list of the URLs of the documents displayed on the result page
4. CLICKS: list of clicks where each element is the number of times the
corresponditng URL was clicked

It seems more natural to me to implement a secondary log command and trigger
it everytime a new query is entered into the query template. It would create
a log file with the above columns/fields i.e. a unique identifier for each
log entry, entered query text, list of documents URLs displayed, a list
of the number of times the corresponding URL was clicked (all the elements
in this list will be initialised as 0 as clicks haven't occurred yet.)

Once we have the log file, all we need to do is update the fourth column
with
click information whenever a click happens by looking for the correct entry
in the file (e.g. by matching the query text) and update the list in the
fourth column accordingly.

Does this entire idea sounds workable?

I'm not entirely sure how to achieve the last part i.e. updating the log
file
with click information but as discussed earlier, a log template can helpful
here. As of now, I think it could be implemented like topterms template
(i.e.
without any structure to be displayed) where we'd jsut invoke some
Omegascript
commands to do the update. Another question -- we'd need to trigger this
template whenever a click happens, so is it possible to have such a
behaviour
from within query template through some existing Omegascript commands?

Thanks,
Vivek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170604/42477ce6/attachment.html>

James Aylett

2017-Jun-04 14:05 UTC

head link

Logging the click data

On 3 Jun 2017, at 22:08, Vivek Pal <vivekpal.dtu at gmail.com> wrote:
> This helped me look at things associated with logging the click data from
> a better perspective. As already documented on the project's wiki page,
> we need the following fields in separater columns:
> 
> 1. ID: some identifier for each entry
> 2. QUERY: text of the query
> 3. URLs: list of the URLs of the documents displayed on the result page
> 4. CLICKS: list of clicks where each element is the number of times the
> corresponditng URL was clicked
That shouldn't be the logging format, for reasons I'll get into shortly.
That's an intermediate view which you'll need to generate from the
logging, which will enable you to create the input files for letor training.
> It seems more natural to me to implement a secondary log command and
trigger
> it everytime a new query is entered into the query template. It would
create
> a log file with the above columns/fields i.e. a unique identifier for each
> log entry, entered query text, list of documents URLs displayed, a list
> of the number of times the corresponding URL was clicked (all the elements
> in this list will be initialised as 0 as clicks haven't occurred yet.)
> 
> Once we have the log file, all we need to do is update the fourth column
with
> click information whenever a click happens by looking for the correct entry
> in the file (e.g. by matching the query text) and update the list in the
> fourth column accordingly.
> 
> Does this entire idea sounds workable?
The problem is "all we need to do is update the fourth column".
Updating is hard, in the sense that every thread processing web requests has to
be able to update the same view (which means you're basically implementing
some sort of database, and have to consider concurrent access and updating).
Easier would likely be to log each individual click, and then provide something
that can process those "raw" files into the intermediate format you
need for your click model ahead of letor training.

I think this makes your job at this point easier, because now you're looking
to emit a pair of files which you can later roll up into the above format (which
is a fairly simple aggregation step). One would contain entries as follows, with
a new entry for each executed search:

ID: some identifier for each query
QUERY: text of the query (when the query is run)
URLs: every URL displayed (or alternatively, the Xapian docid — this might be
easier)
OFFSET: otherwise you'll have difficulty coping with result pages other than
the first page (when this happens, the query ID should probably remain the same,
and when you aggregate you can "glue" the different pages together)

One would then be the clicks, so for each URL clicked in a result page, emit:

ID: the query identifier that matches the entry in the search log
URL: the URL redirected to (again, or the Xapian docid)

This means you need to be able to generate ID for each query, and also that each
clickable URL in the results page will need to go via the omega CGI using a
different template whose job it is to log ID & URL to the click log and then
redirect to URL. Once generated, the ID can be passed through from call to call
(including on pagination). (We'll need a new ID when the query is changed,
in the same way that we reset the page offset, which works by considering xP,
xDB, and xFILTERS.)

If you record the Xapian docid rather than the URL, it's both more compact
and easier to serialise for the search entries (eg something like:
ID,docid1|docid2|docid3|…,OFFSET,QUERY). It can also cope with multiple
documents that lead to the same URL but with different metadata. The downside is
that if someone updates the Xapian document to point to something totally
different, it's difficult to analyse across that boundary, and so the letor
model will be less helpful. (Of course, if it's for an internal search
system, the same is true of URLs if someone changes the content significantly.
So it's not a huge downside, and can simply be documented.)

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

Vivek Pal

2017-Jun-05 04:17 UTC

head link

Logging the click data

Hi James,
> ID: some identifier for each query
> QUERY: text of the query (when the query is run)
> URLs: every URL displayed (or alternatively, the Xapian docid — this
> might be easier)
> OFFSET: otherwise you'll have difficulty coping with result pages other
> than the first page (when this happens, the query ID should probably
> remain the same, and when you aggregate you can "glue" the
different
> pages together)
I'm not clear on what the OFFSET really represents. Could you please
explain a bit? And, I think we certainly need the CLICKS field as
otherwise we can't capture the click information which is essential
to training the click model. This field will need to be of same size
and structure as URLs field (i.e. a list) e.g. [0,1,2,0,0] for 5 urls
in the result page.
> One would then be the clicks, so for each URL clicked in a result page,
> emit:
>
> ID: the query identifier that matches the entry in the search log
> URL: the URL redirected to (again, or the Xapian docid)
>
> This means you need to be able to generate ID for each query, and
> also that each clickable URL in the results page will need to go via the
> omega CGI using a different template whose job it is to log ID & URL
> to the click log and then redirect to URL. Once generated, the ID can
> be passed through from call to call (including on pagination)
So, whenever a click occurs on the result page, we log the query
ID and the clicked url via a different template which will be triggered
with each click event but I'm not sure how we will be to capture the
click information if we don't record the number of times each url was
clicked in a separate CLICKS field? Also, just to be sure, we will log
such pairs of query ID and URL in separate files to be aggregated
later into a single file?

In the end, we will have two files it seems -- one created from the
query template containing separate entries for each executed search
as per the format you described previously and another containing
query IDs and click URLs logged using a different template?

I also wanted to ask how does the log command ($log{query.log}) in
the query template work. It doesn't seem to comply with the format
mentioned in its documentation as it expects two arguments but we
provide only one here i.e. query.log and what does this argument
mean?

Thanks,
Vivek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170605/645009fd/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

Xapian devel - Jun 2017 - Logging the click data

Logging the click data

Logging the click data

Logging the click data

Possibly Parallel Threads