Abhishek Gupta
2014-Mar-20 20:07 UTC
[Xapian-devel] GSoC 2014: Clustering of Search Results
Sir, I am Abhishek Gupta. I know I am quite late for the project discussion because I came to know about GSoC a bit lately but still I would like to discuss this project which interests me a lot. I know I have to submit some code so as to show my skill set but as the deadline is quite near I will submit the patches or exercises after the deadline to strengthen my application and show my coding skill. I read your existing source code for the clustering which is quite slow because of the hierarchical based clustering which is not required at all.*You have already provided with the number of clusters you should have in the end*. So for this we can employ K-means algorithm which can perform far better than the current algorithm. 1) Hierarchical clustering have high memory requirements *O(n*n)* in comparison to *O(n+K)* space complexity of K-means algorithm, where*n* is the number of elements and* K *is the number of clusters. 2) Hierarchical clustering running time is *O(n*n*n)* on the other hand K-means algorithm has time complexity of *O(n)*. 3) K-means improves the clustering iteratively, more you run the code more better you will get the results. One thing that K-means lacks is its non-deterministic outcome. Every time it will produce different clusters. But we can always run the algorithm 10-12 times and then take the average even then it will perform far better than the hierarchical one. So I would like to propose this algorithm which can perform better than the hierarchical one. After that to improve the clustering more we can also implement K-medoids/K-means++ clustering methods. I would you give some reviews regarding the proposal, so that I can submit the proposal at time. Thanks and Regards Abhishek Gupta -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140321/7bda8301/attachment-0002.html>
On Fri, Mar 21, 2014 at 01:37:31AM +0530, Abhishek Gupta wrote:> 1) Hierarchical clustering have high memory requirements *O(n*n)* in > comparison to *O(n+K)* space complexity of K-means algorithm, where*n* is > the number of elements and* K *is the number of clusters. > 2) Hierarchical clustering running time is *O(n*n*n)* on the other hand > K-means algorithm has time complexity of *O(n)*.Are you sure about those complexities? (Hint: at least one is wrong...) Also, O() is the asymptotic behaviour, but what we really care about is it taking well under a second to cluster a limited number of results. Probably 10000 at most, and I suspect more like 1000. Bear in mind that: K2 * n * n < K1 * n for small n What is "small" depends on K1 and K2 of course.> I read your existing source code for the clustering which is quite slow > because of the hierarchical based clustering which is not required at all.*You > have already provided with the number of clusters you should have in the > end*. So for this we can employ K-means algorithm which can perform far > better than the current algorithm.I'd say it is a flaw in the current API that you are expected to provide the number of clusters you want in advance. How can you know how many natural clusters there will be? The current algorithm is at least easily amenable to producing a variable number of clusters (just make the condition to stop linking related to the length the next link would need to be).> One thing that K-means lacks is its non-deterministic outcome. Every time > it will produce different clusters. But we can always run the algorithm > 10-12 times and then take the average even then it will perform far better > than the hierarchical one.Having to run the algorithm repeatedly and then compare the outcomes is going to rather work against fast clustering. And what do you mean by the "average" of 12 different partitions? Non-deterministic isn't ideal, but is probably tolerable. But somehow picking the best of 12 non-deterministic results isn't magically going to be deterministic anyway. I'm not saying K-means isn't an option, but the case you've made for it so far isn't particularly convincing. Cheers, Olly