thr3ads.net - Xapian devel - GSOC-2016 Project : Clustering of search results [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-Mar-05 17:28 UTC

GSOC-2016 Project : Clustering of search results

Hello devs,

I am Richhiey Thomas, pursuing my third year of undergraduate studies in
Computer Science from Mumbai University. I had gone through the project
list for this year and the project idea based on clustering caught my
attention. I spoke to Assem Chelli on IRC who guided me to the code and got
me started.

I started going through the code and have successfully built Xapian on my
machine. If I am not mistaken, the currently implemented clustering branch
has used heirarchial clustering which has quadratic complexity or higher,
which naturally makes it very inefficient for large data sets.

A better approach to clustering would be to use K-means clustering or a
variant (like bisecting K-means) which provides equal or better performance
than the implemented heirarchial technique having lesser (linear)
complexity, thus being more favourable to large datasets.

It would be great to have your opinions on how to go about this or any
other approaches that may be more favorable for document clustering in the
context of Xapian.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160305/cd3c8695/attachment.html>

James Aylett

2016-Mar-06 01:47 UTC

head link

GSOC-2016 Project : Clustering of search results

On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
> I am Richhiey Thomas, pursuing my third year of undergraduate studies in
> Computer Science from Mumbai University. I had gone through the project
> list for this year and the project idea based on clustering caught my
> attention. I spoke to Assem Chelli on IRC who guided me to the code and got
> me started.
Hi Richhiey! That's great, and it's good that IRC is working for
you. (People at some universities occasionally have trouble getting
onto IRC, unfortunately.)
> I started going through the code and have successfully built Xapian on my
> machine. If I am not mistaken, the currently implemented clustering branch
> has used heirarchial clustering which has quadratic complexity or higher,
> which naturally makes it very inefficient for large data sets.
> 
> A better approach to clustering would be to use K-means clustering or a
> variant (like bisecting K-means) which provides equal or better performance
> than the implemented heirarchial technique having lesser (linear)
> complexity, thus being more favourable to large datasets.
> 
> It would be great to have your opinions on how to go about this or any
> other approaches that may be more favorable for document clustering in the
> context of Xapian.
K-Means or something related certainly seems like a viable approach,
so what you'll need to do is to come up with a proposal of how you'd
implement this in Xapian (either with reference to the previous work,
or separately), and also how you'd go about evaluating the performance
of your implementation (both in terms of usefulness of the clustering,
and in terms of speed!).

Hopefully that's enough to get you started, but please do ask any
questions you have while working through this.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Richhiey Thomas

2016-Mar-06 19:48 UTC

head link

GSOC-2016 Project : Clustering of search results

On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>
> K-Means or something related certainly seems like a viable approach,
> so what you'll need to do is to come up with a proposal of how
you'd
> implement this in Xapian (either with reference to the previous work,
> or separately), and also how you'd go about evaluating the performance
> of your implementation (both in terms of usefulness of the clustering,
> and in terms of speed!).
>
> Thanks for the reply James!I went through the code in a little more detail and there are a few things
I noticed and a few questions I have.

First off, the distance metric used in the current implementation is the
cosine measure. Though useful, K-means implicitly uses Euclidian distance
as a measure of document similarity between two document term vectors.
Hence, simply creating one more class for a distance metric by just
inheriting the DocSim base class will be good. Using the tf-idf weights, we
can find term weights and instead of using these vectors for cosine
similarity, euclid distance can be found out.

With a similarity measure in place, we can initialize the k centroids using
k-means++, an algorithm used for choosing the initial centroids in k-means,
to avoid poor clustering results. The distance between document vectors and
centroids can be found out and documents are added to clusters accordingly,
identified by their doc-id's. The new centroid is again found and this
process will continue till convergence.

https://en.wikipedia.org/wiki/K-means%2B%2B

I am slightly unaware about performance evaluation but the cluster quality
can be evaluated through F-measure and I guess we can check the running
times of both the implementations to check for usefulness in terms of
speed.

My questions are:
1) Can you direct me on how to convert this raw idea into a proposal in
context to Xapian with more detail? What areas do I focus on?
2) It would be great if you could elaborate a little on the performance
evaluation part that I haven't been able to follow too well.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/320f659f/attachment.html>

Xapian devel - Mar 2016 - GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results