I was not sharing it on maling list because i thought that someone can use all ideas i proposed in their GSOC proposal. Surely i will contribute to xapian project. sorry if that was against the rules The algorithm is not developed by me but after having much research on various clustering techniques. I found that there is a new algorithm called CLUBS(Clustering Using Binary Splitting) which gives better results than kmeans++ and hierarchical agglomerative clustering. It is faster and produces good results based on various metrics of cluster quality. the algorithm works in following way The first phase of the algorithm is divisive, as the original data set(in this case, set of search documents to cluster) is split recursively into miniclusters through successive binary splits: the algorithm?s second phase is agglomerative since these miniclusters are recombined into the final result. Further, the approach induces during execution a dynamic hierarchical grid that will better fit the dataset w.r.t. classical grid approaches that exploit a fixed grid instead. Finally, the algorithm exploits the analytical properties of the Quadratic Sums of Squares (SSQ in the following) function to minimize the cost of merge and split operations. you can refer the paper for more details about time complexity and performance http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf for implementing it,we can use Documentsource class in our previous clustering approach and create a binary tree and perform and topdownsplitting and then bottomup merging. First we have to implement feature extraction from text document(TFIDF would be a good choice) which is implemented in xapian weighting schemes. Then we will implement function to compute distances between documents based on normalized TF-IDF Matrix. Based on distances we will initially assign cluster and improve on it using topdownsplitting and then bottomup merging. We will compute SSQ of a cluster from all points in a cluster. and then Improve our clustering based on SSQ measures. The functions for topdownsplitting will include computeAverageDeltaSSQ to average the SSQ of all points in cluster. The functions for bottomupmerging will include computeweightedSSQ. . Please Give your valuable suggestions. Have a Nice Day. Regards, Nirmal Singhania B.tech III Yr On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote: > > > based on my some time of Research,I have in mind a clustering algorithm > > that can overcome Quality issues of K-means(and its variants) and Speed > > Issues of Hierarchical Agglomerative Clustering. > > Theoretically it can work O(n) and Can produce results better than HAC > > based on various metrics. > > I can't discuss it on mailing-list but you say we can discuss more about > it > > and its implementation in xapian in PM. > > Hi, Nirmal -- welcome to Xapian! What do you mean when you say that > you can't discuss it on the mailing list? If you can't discuss it in > our public forums, then you aren't going to be able to contribute that > work to Xapian, are you? (Since we'll need it documented clearly, > which will likely include useful details of that discussion.) > > J > > -- > James Aylett, occasional trouble-maker > xapian.org >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160310/5ac9af7f/attachment-0001.html>
On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:> I was not sharing it on maling list because i thought that someone > can use all ideas i proposed in their GSOC proposal.It's usually pretty obvious to us if someone has copied parts of someone else's proposal.> The algorithm is not developed by me but after having much research > on various clustering techniques. I found that there is a new > algorithm called CLUBS(Clustering Using Binary Splitting) which > gives better results than kmeans++ and hierarchical agglomerative > clustering. It is faster and produces good results based on various > metrics of cluster quality.I've only skimmed the paper for now, but it certainly looks interesting. Do you have a reason for picking TFIDF for feature extraction? Are there other approaches that might make sense? You may want to include in your project proposal how you intend to evaluate the speed and accuracy of the final clustering system. It sounds like you have a good handle on how you're going to go about implementing CLUBS in Xapian. Having a detailed plan in your proposal is a good way of demonstrating that you've thought through the practical aspect of adding a feature. When writing up your proposed timeline, remember to break things into small pieces -- no more than a week at most, and you'll probably find some that come in shorter than that. J -- James Aylett, occasional trouble-maker xapian.org
Tf-idf is most used used weighting scheme is easy to understand and has been used in other frameworks like lucene and many other places. okapi bm25(implemented in xapian) is theoretically better/improved measure than tf-idf and i am looking into various other weighting scheme which are there in xapian or can be implemented like TF-ICF(term frequecy inverse corpus frequency),TF-RF(term frequency-relevance frequency) for evaluating the speed and accuracy of final clustering system we can benchmark it against various other algos like k-means,HAC based on the measures mentioned in previous mail.(purity,F-measure,Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio) Please give your suggestions Have a Nice day Regards, Nirmal Singhania B.tech III Yr On Thu, Mar 10, 2016 at 5:46 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote: > > > I was not sharing it on maling list because i thought that someone > > can use all ideas i proposed in their GSOC proposal. > > It's usually pretty obvious to us if someone has copied parts of > someone else's proposal. > > > The algorithm is not developed by me but after having much research > > on various clustering techniques. I found that there is a new > > algorithm called CLUBS(Clustering Using Binary Splitting) which > > gives better results than kmeans++ and hierarchical agglomerative > > clustering. It is faster and produces good results based on various > > metrics of cluster quality. > > I've only skimmed the paper for now, but it certainly looks > interesting. Do you have a reason for picking TFIDF for feature > extraction? Are there other approaches that might make sense? You may > want to include in your project proposal how you intend to evaluate > the speed and accuracy of the final clustering system. > > It sounds like you have a good handle on how you're going to go about > implementing CLUBS in Xapian. Having a detailed plan in your proposal > is a good way of demonstrating that you've thought through the > practical aspect of adding a feature. When writing up your proposed > timeline, remember to break things into small pieces -- no more than a > week at most, and you'll probably find some that come in shorter than > that. > > J > > -- > James Aylett, occasional trouble-maker > xapian.org >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160311/eb39a175/attachment.html>