thr3ads.net - Xapian devel - GSoC-2017 Introduction and Project Discussion [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Shivang Bansal

2017-Mar-16 13:31 UTC

GSoC-2017 Introduction and Project Discussion

Hello,

I'm Shivang Bansal, a 3rd year Computer Science Engineering undergraduate
at Institute of Engineering & Technology in Lucknow, India. This mail is an
expression of my interest for Google Summer of Code program of this year. I
want to apologize for getting in so late. Actually I would have contacted
earlier, but sudden demise of my Grandfather disabled me in doing so.

I am interested in woking with your organisation for two reasons. *First*,
I believe that Xapian provides an incredibly useful tool for any developer
to avail the advantages of a Search Engine. Had I known of its existence
earlier I would have used it extensively in a project regarding web
scraping in which I had scraped the products' information from various
e-commerce websites in a database enabling the users to search for their
required product and compare its price across different sites (just like
any well known price comparator website). *Second*, I am fascinated by
Information Retrieval and I am eager to explore its computational side.
Specifically, I am interested in updating and complementing the current
library to further improve its functionalities.

The project I would like to work on in the summers is *Weighting Schemes*.
Till now, I have successfully build Xapian on my PC along with reading the
complete guide given on the following link:
http://getting-started-with-xapian.readthedocs.io/en/latest/index.html and
worked on *A practical example *given in the guide, in which, although I
faced some problems regarding the *shared library* *xapian-delve-1.5* but
solved them eventually after some googling and going through some of the
related messages in IRC Archives.

Moreover, I have gone through the code base particularly *xapian-core* and
studied *Xapian::Weight* class thoroughly and looked at the implementations
of some of the already defined weighting schemes.

Currently, I've started to look at the ticket https://trac.xapian.
org/ticket/744 and trying to devise a way so that *get_sumpart() *method in
every weight subclass does not need to be updated after the merging. Also,
I am going through *xapian-api *to get more essence of the code base.

The project ideas which I would like to propose and get some feedback on
are:

*1)* Currently, Xapian supports the weighting schemes which rely on
bag-of-word representation of documents assuming that each term in the
document is independent of each other which however is a debatable topic as
in some places word order and word dependence do matter for eg- *Mary is
quicker than John *and* John is quicker than Mary* are different.
I want to implement *Graph-of-word* representation in Xapian which is a
solution to such cases as it considers the relationship order between the
terms in a document using an unweighted directed graph of terms. This
representation can be further used to define a new weighting scheme,
*TW-IDF* (TW = Term Weight , IDF = Inverse Document Frequency) which
*significantly
outperforms* *TF-IDF *&* BM25* and in some cases its extension *BM25+* on
various standard TREC datasets. This effectiveness is not achieved at the
cost of its efficiency. It is confirmed by various experiments shown
in [2].

The papers which I have referred for the above are :-
[1] https://www.researchgate.net/publication/220479875_Graph
-based_term_weighting_for_information_retrieval
[2] https://pdfs.semanticscholar.org/8eac/d0f01ab0f53706561d
da0ce8d1f96544a348.pdf

*2)* There is another Weighting Scheme I would like to implement in Xapian,
*TF-ATO* (Term Frequency - Average Term Occurrences) mentioned in
http://eprints.nottingham.ac.uk/31329/1/dls_ukci2014.pdf with a
discriminative approach which uses* document centroid* as a *threshold* for
normalising documents in document collection. This document centroid is
used to remove less significant weights in the documents and helps to
achieve higher retrieval effectiveness. The average improving rate in
*precision* (non-interpolated average precision) between TF-IDF and TF-ATO
is *40-50 %.* This scheme with some extra work, also performs effectively
for dynamic data collection.

It would be great to have your opinions on these project ideas as it will
help me to come up with a proposal on how to implement them. I realise that
it's quite late but I will start working on my proposal asap so that I can
improve by getting your feedback on the same.

I'm sorry if this mail gets too long. Thank you so much for your time.

Shivang Bansal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170316/f6ddd708/attachment.html>

James Aylett

2017-Mar-19 12:15 UTC

head link

GSoC-2017 Introduction and Project Discussion

On 16 Mar 2017, at 13:31, Shivang Bansal <shivangbansal1995 at gmail.com>
wrote:
> I'm Shivang Bansal, a 3rd year Computer Science Engineering
undergraduate at Institute of Engineering & Technology in Lucknow, India.
Hi Shivang! Welcome to Xapian :)
> The project I would like to work on in the summers is Weighting Schemes. 
> Till now, I have successfully build Xapian on my PC along with reading the
complete guide given on the following link:
> http://getting-started-with-xapian.readthedocs.io/en/latest/index.html and
worked on A practical example given in the guide, in which, although I faced
some problems regarding the shared library xapian-delve-1.5 but solved them
eventually after some googling and going through some of the related messages in
IRC Archives.
It's worth pointing out that xapian-delve is not a shared library. Perhaps
you mean you had difficulties with shared libraries while building or running
it? The fact that you're using xapian-delve-1.5 tells me that you're
using the development work from git, which is not the recommended approach in
the getting started guide, but is absolutely the right way when working on
Xapian's codebase. (You probably need to look at the Xapian developer guide
to ensure you're set up properly with the development codebase:
https://xapian-developer-guide.readthedocs.io/en/latest/)
> The project ideas which I would like to propose and get some feedback on
are:
> 
> 1) Currently, Xapian supports the weighting schemes which rely on
bag-of-word representation of documents assuming that each term in the document
is independent of each other which however is a debatable topic as in some
places word order and word dependence do matter for eg- Mary is quicker than
John and John is quicker than Mary are different.
> I want to implement Graph-of-word representation in Xapian which is a
solution to such cases as it considers the relationship order between the terms
in a document using an unweighted directed graph of terms. This representation
can be further used to define a new weighting scheme, TW-IDF (TW = Term Weight ,
IDF = Inverse Document Frequency) which significantly outperforms TF-IDF &
BM25 and in some cases its extension BM25+ on various standard TREC datasets.
This effectiveness is not achieved at the cost of its efficiency. It is
confirmed by various experiments shown in [2].
You'll need to propose quite a lot of detail on how you're going to
implement this using the Xapian backend database and weighting system. I suspect
you'll have to extend it a fair amount to support TW-IDF, because we have no
graph support at present. I haven't read the paper though, so it's
possible you can do this using some preprocessing in some way.

It's also worth noting that we've sometimes seen quite different
evaluation results to the academic research in the past. There's a module
that implements some evaluation metrics
(https://github.com/samuelharden/xapian-evaluation) which can be used to gauge
how a new weighting scheme compares to the others we have, designed to run with
TREC data. (We also have organisational access to the FIRE datasets for this
purpose.)

(Ideally we'd merge the evaluation work into the main repository rather than
keeping it separate, but we haven't had time to do that yet. That could be a
useful part of a summer of code project looking at weighting schemes.)
> 2) There is another Weighting Scheme I would like to implement in Xapian,
TF-ATO (Term Frequency - Average Term Occurrences) mentioned in
http://eprints.nottingham.ac.uk/31329/1/dls_ukci2014.pdf with a discriminative
approach which uses document centroid as a threshold for normalising documents
in document collection. This document centroid is used to remove less
significant weights in the documents and helps to achieve higher retrieval
effectiveness. The average improving rate in precision (non-interpolated average
precision) between TF-IDF and TF-ATO is 40-50 %. This scheme with some extra
work, also performs effectively for dynamic data collection.
Again, you'll need to propose how to fit TF-ATO into Xapian's database
and weighting framework. Can the document centroid approach be done as a
processing step during indexing? (Again, I haven't had time to read the
paper yet.)

Best,
James

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

Shivang Bansal

2017-Mar-19 17:43 UTC

head link

GSoC-2017 Introduction and Project Discussion

> Hi Shivang! Welcome to Xapian :)
Hello James. Thank you :)
> It's worth pointing out that xapian-delve is not a shared library.Perhaps you mean you had difficulties with shared libraries while building
or >running it?

Yeah that's exactly what I meant, libxapian-1.5.so to be specifically.
>The fact that you're using xapian-delve-1.5 tells me that you're
using thedevelopment work from git, which is not the recommended >approach in the
getting started guide, but is absolutely the right way when working on
Xapian's codebase. (You probably need to look at the >Xapian developer
guide to ensure you're set up properly with the development codebase:
https://xapian-developer->guide.readthedocs.io/en/latest/)

Thanks for mentioning, I've got your point.
> You'll need to propose quite a lot of detail on how you're going toimplement this using the Xapian backend database and weighting >system. I
suspect you'll have to extend it a fair amount to support TW-IDF, because
we have no graph support at present. I haven't read the >paper though, so
it's possible you can do this using some preprocessing in some way.

I will surely try to propose a completely detailed plan on how
graph-of-words model could be implemented in Xapian. The fact that Xapian
does not support graph at present could make things a bit difficult. But, I
strongly believe that it would be worthful as this model will judge the
terms according to their relationship order in the documents which would
enhance the effectiveness of search results (please let me know if you
think otherwise).
> It's also worth noting that we've sometimes seen quite differentevaluation results to the academic research in the past. There's a
module>that implements some evaluation metrics (https://github.com/samuelharden/xapian-evaluation) which can be used to
gauge how a new ?>weighting scheme compares to the others we have, designed
to run with TREC data.

This module will surely play a key role in the evaluation. I will look into
it as well.
> Again, you'll need to propose how to fit TF-ATO into Xapian's
databaseand weighting framework. Can the document centroid approach be >done as a
processing step during indexing? (Again, I haven't had time to read the
paper yet.)

I didn't get exactly, what you mean by 'during indexing'. What I can
think
of is, after we had indexed all the terms and assigned weights to each
index term. If that's what you mean then yes, document centroid approach
can be computed at that time only from the documents vectors of term
weights. Its basic aim is to reduce the size of documents in the dataset by
the discriminative approach which is nothing, but to remove the terms
having weight lesser than documents centroid. According to the paper, this
approach gives an average reduction in size of 2.3% from the actual dataset
size.
I will come up with a detailed proposal for this scheme as well.

Thank You for your time.

Regards,
Shivang Bansal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170319/f69eee1a/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

Xapian devel - Mar 2017 - GSoC-2017 Introduction and Project Discussion

GSoC-2017 Introduction and Project Discussion

GSoC-2017 Introduction and Project Discussion

GSoC-2017 Introduction and Project Discussion

Possibly Parallel Threads