thr3ads.net - Xapian devel - GSOC-2016 Project : Clustering of search results [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-Mar-06 19:48 UTC

GSOC-2016 Project : Clustering of search results

On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>
> K-Means or something related certainly seems like a viable approach,
> so what you'll need to do is to come up with a proposal of how
you'd
> implement this in Xapian (either with reference to the previous work,
> or separately), and also how you'd go about evaluating the performance
> of your implementation (both in terms of usefulness of the clustering,
> and in terms of speed!).
>
> Thanks for the reply James!I went through the code in a little more detail and there are a few things
I noticed and a few questions I have.

First off, the distance metric used in the current implementation is the
cosine measure. Though useful, K-means implicitly uses Euclidian distance
as a measure of document similarity between two document term vectors.
Hence, simply creating one more class for a distance metric by just
inheriting the DocSim base class will be good. Using the tf-idf weights, we
can find term weights and instead of using these vectors for cosine
similarity, euclid distance can be found out.

With a similarity measure in place, we can initialize the k centroids using
k-means++, an algorithm used for choosing the initial centroids in k-means,
to avoid poor clustering results. The distance between document vectors and
centroids can be found out and documents are added to clusters accordingly,
identified by their doc-id's. The new centroid is again found and this
process will continue till convergence.

https://en.wikipedia.org/wiki/K-means%2B%2B

I am slightly unaware about performance evaluation but the cluster quality
can be evaluated through F-measure and I guess we can check the running
times of both the implementations to check for usefulness in terms of
speed.

My questions are:
1) Can you direct me on how to convert this raw idea into a proposal in
context to Xapian with more detail? What areas do I focus on?
2) It would be great if you could elaborate a little on the performance
evaluation part that I haven't been able to follow too well.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/320f659f/attachment.html>

Richhiey Thomas

2016-Mar-06 19:53 UTC

head link

GSOC-2016 Project : Clustering of search results

>
>
> On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at
tartarus.org>
> wrote:
>
>> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>>
>> K-Means or something related certainly seems like a viable approach,
>> so what you'll need to do is to come up with a proposal of how
you'd
>> implement this in Xapian (either with reference to the previous work,
>> or separately), and also how you'd go about evaluating the
performance
>> of your implementation (both in terms of usefulness of the clustering,
>> and in terms of speed!).
>>
>> Thanks for the reply James!
> I went through the code in a little more detail and there are a few things
> I noticed and a few questions I have.
>
> First off, the distance metric used in the current implementation is the
> cosine measure. Though useful, K-means implicitly uses Euclidian distance
> as a measure of document similarity between two document term vectors.
> Hence, simply creating one more class for a distance metric by just
> inheriting the DocSim base class will be good. Using the tf-idf weights, we
> can find term weights and instead of using these vectors for cosine
> similarity, euclid distance can be found out.
>
> With a similarity measure in place, we can initialize the k centroids
> using k-means++, an algorithm used for choosing the initial centroids in
> k-means, to avoid poor clustering results. The distance between document
> vectors and centroids can be found out and documents are added to clusters
> accordingly, identified by their doc-id's. The new centroid is again
found
> and this process will continue till convergence.
>
> https://en.wikipedia.org/wiki/K-means%2B%2B
>
> I am slightly unaware about performance evaluation but the cluster quality
> can be evaluated through F-measure and I guess we can check the running
> times of both the implementations to check for usefulness in terms of
> speed.
>
> My questions are:
> 1) Can you direct me on how to convert this raw idea into a proposal in
> context to Xapian with more detail? What areas do I focus on?
> 2) It would be great if you could elaborate a little on the performance
> evaluation part that I haven't been able to follow too well.
>
> Thanks! :)
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/63ed5325/attachment.html>

Richhiey Thomas

2016-Mar-06 20:06 UTC

head link

GSOC-2016 Project : Clustering of search results

On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org
<javascript:_e(%7B%7D,'cvml','james-xapian at
tartarus.org');>> wrote:
> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>
> K-Means or something related certainly seems like a viable approach,
> so what you'll need to do is to come up with a proposal of how
you'd
> implement this in Xapian (either with reference to the previous work,
> or separately), and also how you'd go about evaluating the performance
> of your implementation (both in terms of usefulness of the clustering,
> and in terms of speed!).
>
>
>  Sorry about sending this msg twice. I made a mistake in the last mail andneeded to make sure this one will be read.

Thanks for the reply James!
I went through the code in a little more detail and there are a few things
I noticed and a few questions I have.

First off, the distance metric used in the current implementation is the
cosine measure. Though useful, K-means implicitly uses Euclidian distance
as a measure of document similarity between two document term vectors.
Hence, simply creating one more class for a distance metric by just
inheriting the DocSim base class will be good. Using the tf-idf weights, we
can find term weights and instead of using these vectors for cosine
similarity, euclid distance can be found out.

With a similarity measure in place, we can initialize the k centroids using
k-means++, an algorithm used for choosing the initial centroids in k-means,
to avoid poor clustering results. The distance between document vectors and
centroids can be found out and documents are added to clusters accordingly,
identified by their doc-id's. The new centroid is again found and this
process will continue till convergence.

https://en.wikipedia.org/wiki/K-means%2B%2B

I am slightly unaware about performance evaluation but the cluster quality
can be evaluated through F-measure and I guess we can check the running
times of both the implementations to check for usefulness in terms of
speed.

My questions are:
1) Can you direct me on how to convert this raw idea into a proposal in
context to Xapian with more detail? What areas do I focus on?
2) It would be great if you could elaborate a little on the performance
evaluation part that I haven't been able to follow too well.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/a5ce6299/attachment.html>

James Aylett

2016-Mar-07 15:28 UTC

head link

GSOC-2016 Project : Clustering of search results

On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:
> My questions are:
> 1) Can you direct me on how to convert this raw idea into a proposal in
> context to Xapian with more detail? What areas do I focus on?
Our GSoC guide has an application template
<https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you
should use to structure your proposal. It has some recommendations on
how you should lay out and think about a proposal, particularly in
terms of planning your timeline, which in the past has proven one of
the keys to a successful project.
> 2) It would be great if you could elaborate a little on the performance
> evaluation part that I haven't been able to follow too well.
'Performance' is a tricky word in information retrieval, so I'll
break
this into two pieces: speed and quality.

As the project notes say, the previous attempt at implementing
clustering was far too slow to be practically useful. So that's the
speed side: we want to be able to cluster a fairly large number of
documents quickly (which would need some thought -- do we want to be
able to cluster 1000 documents in under a second? 10,000 in a handful
of seconds? or might 1000 documents in a handful of seconds be
sufficient?).

Quality can be judged in a number of ways, but we're generally trying
to produce 'good' clusters as a human with knowledge of the subject
area would create. There's some discussion of how you might evaluate
this in Introduction to Information Retrieval, section 16.3 p356 (or
online at
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html).

It's perhaps worth pointing out that Hearst (2009, p200) suggests that
monothetic clusters 'may be easier for users to understand', although
it doesn't cite any specific work to back up this claim. But that may
argue that a K-means based approach isn't necessarily going to be the
most helpful in all cases; there may be other approaches worth
considering instead. (That entire section on clustering is worth
reading if you have access to the book.)

Hearst, MA (2009) 'Search User Interfaces', CUP, Cambridge.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Maybe Matching Threads

Search for more maybe matching threads

Xapian devel - Mar 2016 - GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

Maybe Matching Threads