Hi, I spent this week mostly understanding how the existing templates work and setting up Omega CGI on my system to have a better grasp over how things actually look like when using different templates. This helped me look at things associated with logging the click data from a better perspective. As already documented on the project's wiki page, we need the following fields in separater columns: 1. ID: some identifier for each entry 2. QUERY: text of the query 3. URLs: list of the URLs of the documents displayed on the result page 4. CLICKS: list of clicks where each element is the number of times the corresponditng URL was clicked It seems more natural to me to implement a secondary log command and trigger it everytime a new query is entered into the query template. It would create a log file with the above columns/fields i.e. a unique identifier for each log entry, entered query text, list of documents URLs displayed, a list of the number of times the corresponding URL was clicked (all the elements in this list will be initialised as 0 as clicks haven't occurred yet.) Once we have the log file, all we need to do is update the fourth column with click information whenever a click happens by looking for the correct entry in the file (e.g. by matching the query text) and update the list in the fourth column accordingly. Does this entire idea sounds workable? I'm not entirely sure how to achieve the last part i.e. updating the log file with click information but as discussed earlier, a log template can helpful here. As of now, I think it could be implemented like topterms template (i.e. without any structure to be displayed) where we'd jsut invoke some Omegascript commands to do the update. Another question -- we'd need to trigger this template whenever a click happens, so is it possible to have such a behaviour from within query template through some existing Omegascript commands? Thanks, Vivek -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170604/42477ce6/attachment.html>
On 3 Jun 2017, at 22:08, Vivek Pal <vivekpal.dtu at gmail.com> wrote:> This helped me look at things associated with logging the click data from > a better perspective. As already documented on the project's wiki page, > we need the following fields in separater columns: > > 1. ID: some identifier for each entry > 2. QUERY: text of the query > 3. URLs: list of the URLs of the documents displayed on the result page > 4. CLICKS: list of clicks where each element is the number of times the > corresponditng URL was clickedThat shouldn't be the logging format, for reasons I'll get into shortly. That's an intermediate view which you'll need to generate from the logging, which will enable you to create the input files for letor training.> It seems more natural to me to implement a secondary log command and trigger > it everytime a new query is entered into the query template. It would create > a log file with the above columns/fields i.e. a unique identifier for each > log entry, entered query text, list of documents URLs displayed, a list > of the number of times the corresponding URL was clicked (all the elements > in this list will be initialised as 0 as clicks haven't occurred yet.) > > Once we have the log file, all we need to do is update the fourth column with > click information whenever a click happens by looking for the correct entry > in the file (e.g. by matching the query text) and update the list in the > fourth column accordingly. > > Does this entire idea sounds workable?The problem is "all we need to do is update the fourth column". Updating is hard, in the sense that every thread processing web requests has to be able to update the same view (which means you're basically implementing some sort of database, and have to consider concurrent access and updating). Easier would likely be to log each individual click, and then provide something that can process those "raw" files into the intermediate format you need for your click model ahead of letor training. I think this makes your job at this point easier, because now you're looking to emit a pair of files which you can later roll up into the above format (which is a fairly simple aggregation step). One would contain entries as follows, with a new entry for each executed search: ID: some identifier for each query QUERY: text of the query (when the query is run) URLs: every URL displayed (or alternatively, the Xapian docid — this might be easier) OFFSET: otherwise you'll have difficulty coping with result pages other than the first page (when this happens, the query ID should probably remain the same, and when you aggregate you can "glue" the different pages together) One would then be the clicks, so for each URL clicked in a result page, emit: ID: the query identifier that matches the entry in the search log URL: the URL redirected to (again, or the Xapian docid) This means you need to be able to generate ID for each query, and also that each clickable URL in the results page will need to go via the omega CGI using a different template whose job it is to log ID & URL to the click log and then redirect to URL. Once generated, the ID can be passed through from call to call (including on pagination). (We'll need a new ID when the query is changed, in the same way that we reset the page offset, which works by considering xP, xDB, and xFILTERS.) If you record the Xapian docid rather than the URL, it's both more compact and easier to serialise for the search entries (eg something like: ID,docid1|docid2|docid3|…,OFFSET,QUERY). It can also cope with multiple documents that lead to the same URL but with different metadata. The downside is that if someone updates the Xapian document to point to something totally different, it's difficult to analyse across that boundary, and so the letor model will be less helpful. (Of course, if it's for an internal search system, the same is true of URLs if someone changes the content significantly. So it's not a huge downside, and can simply be documented.) J -- James Aylett devfort.com — spacelog.org — tartarus.org/james/
Hi James,> ID: some identifier for each query > QUERY: text of the query (when the query is run) > URLs: every URL displayed (or alternatively, the Xapian docid — this > might be easier) > OFFSET: otherwise you'll have difficulty coping with result pages other > than the first page (when this happens, the query ID should probably > remain the same, and when you aggregate you can "glue" the different > pages together)I'm not clear on what the OFFSET really represents. Could you please explain a bit? And, I think we certainly need the CLICKS field as otherwise we can't capture the click information which is essential to training the click model. This field will need to be of same size and structure as URLs field (i.e. a list) e.g. [0,1,2,0,0] for 5 urls in the result page.> One would then be the clicks, so for each URL clicked in a result page, > emit: > > ID: the query identifier that matches the entry in the search log > URL: the URL redirected to (again, or the Xapian docid) > > This means you need to be able to generate ID for each query, and > also that each clickable URL in the results page will need to go via the > omega CGI using a different template whose job it is to log ID & URL > to the click log and then redirect to URL. Once generated, the ID can > be passed through from call to call (including on pagination)So, whenever a click occurs on the result page, we log the query ID and the clicked url via a different template which will be triggered with each click event but I'm not sure how we will be to capture the click information if we don't record the number of times each url was clicked in a separate CLICKS field? Also, just to be sure, we will log such pairs of query ID and URL in separate files to be aggregated later into a single file? In the end, we will have two files it seems -- one created from the query template containing separate entries for each executed search as per the format you described previously and another containing query IDs and click URLs logged using a different template? I also wanted to ask how does the log command ($log{query.log}) in the query template work. It doesn't seem to comply with the format mentioned in its documentation as it expects two arguments but we provide only one here i.e. query.log and what does this argument mean? Thanks, Vivek -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170605/645009fd/attachment.html>