Hello,
I am Uppinder Chugh (irc nick: icebyte), a senior year undergraduate
student majoring in Computer Science and Engineering at Indian Institute of
Technology, Guwahati. I'm interested to work on the idea of adding the
functionality of search result diversification to Xapian. After having
brief conversations with mentors on IRC, I would like to compile the
discussions and further discuss the approach I have in mind for
implementing the same. First I'd like to give a brief justification of
choosing the proposed method:
In the literature [1], diversification of search results can be broadly
classified into two methods: Implicit (uses document features) and Explicit
(uses query based features such as query logs/click through logs). The
current state-of-the-art in terms of effectiveness and efficiency are
explicit methods, but with major drawbacks:
a) Requires external source of data (other than documents) such as
query logs/click logs.
b) Doesn't work well on rare queries (mainly due to lack of such data
in the external source).
To make diversification in search results more widely usable and have
it work out of the box, it is desirable to implement an implicit method.
After going through some papers I found [2] to be suitable for our need. It
mentions an efficient implicit method that runs under 100ms and has
effectiveness close to that of the best explicit methods. I'm briefly
presenting the details of the methods that I propose to implement:
1) Implementing GLS-MPT:
First I'll be implementing the GLS algorithm (details on page 6 of [2])
with objective function based on MPT (details on page 13 of [2]). According
to the paper, avg. latency for the GLS-MPT is ~800ms (table V of [2]),
which involves preprocessing (storing pairwise distance of top N documents,
N = 100) and the actual diversification which finally returns top-k (k 20)
diversified results. Note: k = 20 and N = 100 are suitable as usually
users don't page too far and even if they do, it's suitable to let the
results tail off in relevance.
The above procedure can be included in the pipeline of producing the final
results by wrapping Xapian::Mset Matcher::get_set(..). The algorithm
requires top-N document vectors and their relevant scores, which are
readily available. This gives a complete diversification implementation,
although it being quite slow.
2) Optimising GLS-MPT:
One of the main contributions of the paper is optimising GLS-MPT. It
involves clustering the top N documents and storing the distance between
each document in N and each cluster (there are k clusters). They term this
method C2-GLS (details on page 7 and 8 of [2]). The clustering method used
is LC [3]. The avg. latency reduces to ~20-40 ms (table V of [2]), while
improving the effectiveness.
This involves modifying the code of 1) and implementing the LC clustering
algorithm. LC can be added as an algorithm in xapian-core/cluster.
3) Evaluation:
The evaluation metric widely used in literature are alpha-NDCG [4] and
ERR-IA [5]. These are also among the metrics used by the main paper [2],
thus allowing for direct comparison with the author's results. The data set
corpus used in the literature the most is ClueWeb09 with queries from TREC
2009/2010 diversification task. The problem is that both these are not
freely available. For implementation, I'd like to implement alpha-NDCG and
if time persists then ERR-IA.
Please let me know what you think, any kind of feedback is appreciated.
[1] Search Result Diversification Santos et al. 2015 (Survey Paper)
[2] Scalable and Efficient Web Search Result Diversification Naini et al.
2016
[3] Modelling efficient novelty-based search result diversification in
metric spaces Gil-Costa et al. 2013
[4] Novelty and Diversity in Information Retrieval Evaluation Clarke et al.
2008
[5] Intent-based diversification of web search results: Metrics and
algorithms. Chapelle et al. 2011b
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180225/e1d3b296/attachment.html>