thr3ads.net - Xapian devel - GSoC 2017 Project Proposal [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2017-Mar-09 05:18 UTC

GSoC 2017 Project Proposal

Hello devs.

I would like to propose how I plan to go about improving and getting a
system that can be integrated into Xapian in this GSoC for the clustering
branch.

I have identified three areas of work which were not touched last time.

1) Automated Performance Analysis
I had roughly implemented 2 evaluation techniques previously (Distance b/w
document and centroids within clusters and Silhouette coefficient) but I
hadn't implemented them within Xapian, and thus it wasn't possible to
automate the process of evaluating the clustering results in the
ClusterSet. It is thus important to implement cluster evaluation techniques
within a module (as a ClusterEvaluation class) so that users can get output
on how they can improve their clustering by passing in the ClusterSet (and
the labels if necessary).

The cluster evaluation techniques that I would like to consider are :
a) Silhouette coefficient
b) Adjusted Rand Index
c) Fowlkes Mallows index
d) F - Measure
e) Homogeneity, Completeness and V-Measure

2) Dimensionality Reduction
Due to high dimensionality of text documents, it is necessary to have
atleast one semantic dimensionality reduction technique. For this, I would
like to implement Latent Semantic Analysis for dimensionality reduction of
input document vectors.
LSA transforms the original data in a different space so that two
documents/words about the same concept are mapped close (so that they have
higher cosine similarity). LSA achieves this by Singular Value
Decomposition (SVD) of term-document matrix.
What I have found currently is that when we eliminate words that occur
rarely in documents, we can have the algorithm run very fast. The main
problem in runtime performance stems out of document vectors ending up to
be very high dimensional.

3) Hierarchical Clustering
Since the clustering API is already in place, I would like to implement a
hierarchical clusterer to cluster the search results.

Currently, I have created a new PR for reviewing the work done previously
so that it can be merged as soon as possible and trying to optimize the
code and find out different bottlenecks, so that speed can be improved.

It would be great to have feedback on what everyone thinks about this so
that I can re-implement or improve things and make them better.

Thanks. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170309/6f875819/attachment.html>

James Aylett

2017-Mar-19 19:00 UTC

head link

GSoC 2017 Project Proposal

On 9 Mar 2017, at 05:18, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> Hello devs.
Hi Richhiey :-)
> 1) Automated Performance Analysis
This is definitely interesting. However it's not terribly worth having this
unless/until we have more than one clustering system to evaluate, is it? (Beyond
the uniform/random one, although I guess if any clusterer performs _worse_ than
that, it's a bad sign!)
> 2) Dimensionality Reduction
Do you think we'll need to implement several of these, for different uses?
If not, is there a reason you think LSA will work best? You talk about
eliminating words that occur rarely in documents — could we have a
quick-and-dirty approach that looks at the within-corpus frequency of terms?

(If multiple different approaches makes sense, we'll end up having each
encapsulated as a class or similar, in the same way we're doing with
Clusterer. It's worth having an idea of this up front, in case it changes
the approach.)
> 3) Hierarchical Clustering
> Since the clustering API is already in place, I would like to implement a
hierarchical clusterer to cluster the search results.
Do you have a particular approach you think is a good one? Are you thinking
agglomerative or divisive?

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

Richhiey Thomas

2017-Mar-20 05:47 UTC

head link

GSoC 2017 Project Proposal

> This is definitely interesting. However it's not terribly worth having
> this unless/until we have more than one clustering system to evaluate, is
> it? (Beyond the uniform/random one, although I guess if any clusterer
> performs _worse_ than that, it's a bad sign!)
>
This module can still help us with the KMeans clusterer implemented and
since I would like to implement a hierarchal clusterer, it could help in
relative comparison too.
Before, I was looking at both internal as well as external evaluation
techniques. But I guess a regular use case of this API will not provide a
way to have ground truth labels for documents. Thus internal evaluation
techniques would be the better option. I would thus like to change my
approach and introduce a few internal clustering evaluation techniques.
They are:

1) Silhouette coefficient
2) Dunn Index
3) Root Mean Square Standard Deviation
4) Calinski-Harabasz index
5) Davies-Bouldin index

I would be trying to aim at implementing this Performance analysis module
by the end of the community bonding period. Also, once this is set up, it
would be easier to code and evaluate newer clusterers, with minimal changes
in API.

>
> Do you think we'll need to implement several of these, for different
uses?
> If not, is there a reason you think LSA will work best? You talk about
> eliminating words that occur rarely in documents — could we have a
> quick-and-dirty approach that looks at the within-corpus frequency of
terms?
>
I did try the quick-and-dirty approach that you mentioned, but looking at
within-corpus frequency removes a lot of words that could otherwise add
meaning. This will also be corpus-dependent, and hence a bad idea.
For now, removing stop words and the stemmed duplicates within the Document
has helped, but it could be better to add functionality for semantic
dimensionality reduction like LSA.
LSA would work best because over various other methods, it is a text mining
tool rather than a statistical tool.
I'm not sure whether implementing more than one technique will be in the
scope of GSoC, but I see no harm in believing that we could use more. So we
could create a class like DimReduction and sub-class it for implementing
various techniques.
>
>
> Do you have a particular approach you think is a good one? Are you
> thinking agglomerative or divisive?
>
 I was thinking of agglomerative clustering where we start from individual
clusters going to a cluster containing all documents.
This would be fairly simple to implement since we have the API fairly in
place. We would only need to find a way to merge two clusters while going
up the hierarchy tree.
So as we start off, we can initialize multiple clusters having their own
documents with the Cluster class, and then at each step, merging upwards by
merging two of the Cluster object contents into one and calculating new
cluster centroids. We just need to do this iteratively till the number of
clusters end up being one.

I will document all my findings from this conversation and my previous
ideas into a proposal and send it soon.

Thanks.
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170320/10885eed/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

Xapian devel - Mar 2017 - GSoC 2017 Project Proposal

GSoC 2017 Project Proposal

GSoC 2017 Project Proposal

GSoC 2017 Project Proposal

Apparently Analagous Threads