thr3ads.net - Xapian devel - [Xapian-devel] GSoC 2014: Clustering of Search Results [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Abhishek Gupta

2014-Mar-20 20:07 UTC

[Xapian-devel] GSoC 2014: Clustering of Search Results

Sir,

I am Abhishek Gupta. I know I am quite late for the project discussion
because I came to know about GSoC a bit lately but still I would like to
discuss this project which interests me a lot. I know I have to submit some
code so as to show my skill set but as the deadline is quite near I will
submit the patches or exercises after the deadline to strengthen my
application and show my coding skill.
I read your existing source code for the clustering which is quite slow
because of the hierarchical based clustering which is not required at all.*You
have already provided with the number of clusters you should have in the
end*. So for this we can employ K-means algorithm which can perform far
better than the current algorithm.

1) Hierarchical clustering have high memory requirements *O(n*n)* in
comparison to *O(n+K)* space complexity of K-means algorithm, where*n* is
the number of elements and* K *is the number of clusters.
2) Hierarchical clustering running time is *O(n*n*n)* on the other hand
K-means algorithm has time complexity of *O(n)*.
3) K-means improves the clustering iteratively, more you run the code more
better you will get the results.

One thing that K-means lacks is its non-deterministic outcome. Every time
it will produce different clusters. But we can always run the algorithm
10-12 times and then take the average even then it will perform far better
than the hierarchical one.

So I would like to propose this algorithm which can perform better than the
hierarchical one. After that to improve the clustering more we can also
implement K-medoids/K-means++ clustering methods.

I would you give some reviews regarding the proposal, so that I can submit
the proposal at time.

Thanks and Regards
Abhishek Gupta
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140321/7bda8301/attachment-0002.html>

Olly Betts

2014-Mar-22 01:59 UTC

head link

[Xapian-devel] GSoC 2014: Clustering of Search Results

On Fri, Mar 21, 2014 at 01:37:31AM +0530, Abhishek Gupta
wrote:> 1) Hierarchical clustering have high memory requirements *O(n*n)* in
> comparison to *O(n+K)* space complexity of K-means algorithm, where*n* is
> the number of elements and* K *is the number of clusters.
> 2) Hierarchical clustering running time is *O(n*n*n)* on the other hand
> K-means algorithm has time complexity of *O(n)*.
Are you sure about those complexities?

(Hint: at least one is wrong...)

Also, O() is the asymptotic behaviour, but what we really care about is
it taking well under a second to cluster a limited number of results.
Probably 10000 at most, and I suspect more like 1000.  Bear in mind
that:

    K2 * n * n < K1 * n  for small n

What is "small" depends on K1 and K2 of course.
> I read your existing source code for the clustering which is quite slow
> because of the hierarchical based clustering which is not required at
all.*You
> have already provided with the number of clusters you should have in the
> end*. So for this we can employ K-means algorithm which can perform far
> better than the current algorithm.
I'd say it is a flaw in the current API that you are expected to provide
the number of clusters you want in advance.  How can you know how many
natural clusters there will be?

The current algorithm is at least easily amenable to producing a
variable number of clusters (just make the condition to stop linking
related to the length the next link would need to be).
> One thing that K-means lacks is its non-deterministic outcome. Every time
> it will produce different clusters. But we can always run the algorithm
> 10-12 times and then take the average even then it will perform far better
> than the hierarchical one.
Having to run the algorithm repeatedly and then compare the outcomes is
going to rather work against fast clustering.  And what do you mean by
the "average" of 12 different partitions?

Non-deterministic isn't ideal, but is probably tolerable.  But somehow
picking the best of 12 non-deterministic results isn't magically going
to be deterministic anyway.

I'm not saying K-means isn't an option, but the case you've made for
it so far isn't particularly convincing.

Cheers,
    Olly

Xapian devel - Mar 2014 - GSoC 2014: Clustering of Search Results

[Xapian-devel] GSoC 2014: Clustering of Search Results

[Xapian-devel] GSoC 2014: Clustering of Search Results