Hello All,I am Nirmal Singhania from NIIT University,India. I am interested in Clustering of search results Topic. I have been in field of practical machine learning and information retrieval from quite some time. I took various courses/MOOC on Information retrieval and Text Mining and have been working on real life datasets(KDD99,AWID,Movielens). Because the problems you face in real life ML/IR scenario is different is different from what taught in theory.I am also working on R&D on "Hybrid Techniques for Intrusion Detection using Data Mining and Clustering on Newer Datasets". Taking initial look at the docsim folder in xapian-core. These are my insights The clustering used is Single Link Agglomerative Hierarchical clustering. Its Time Complexity is O(n^2) for n=number of documents. At first Choosing K-means seems to be viable solution as K-Means has O(n) Time Complexity. But it has various Shortcomings 1) The learning algorithm requires apriori specification of the number of cluster centers. 2)Different Initial Partitions can result in different final clusters 3)It does not work well with clusters of different size and Different Density. After That we Can Think of KMeans++ The *k*-means++ algorithm addresses the first of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard *k*-means optimization iterations But it is a little bit slow due to cluster initialization. Then we can think of bisecting k-means which is better than k-means.but the bisecting K-means algorithm is a divisive hierarchical clustering algorithm It is little bit faster than original k-means but the results of clustering are poorer than Hierarchical agglomerative clustering based on various Metrics of Cluster quality such as Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio. based on my some time of Research,I have in mind a clustering algorithm that can overcome Quality issues of K-means(and its variants) and Speed Issues of Hierarchical Agglomerative Clustering. Theoretically it can work O(n) and Can produce results better than HAC based on various metrics. I can't discuss it on mailing-list but you say we can discuss more about it and its implementation in xapian in PM. Thank you for your Time Regards, Nirmal Singhania B.tech III Yr -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/9d6e6136/attachment-0001.html>
And Yes,the similarity measure for document similarity is cosine similarity. For the algorithm i proposed in trailing mail,i have to implement euclidean distance similarity measure and tweak it to make it work well with the algorithm. waiting for your suggestions. Regards, Nirmal Singhania B.tech III Yr On Wed, Mar 9, 2016 at 10:27 AM, nirmal singhania < nirmal.singhania at st.niituniversity.in> wrote:> Hello All,I am Nirmal Singhania from NIIT University,India. > I am interested in Clustering of search results Topic. > > I have been in field of practical machine learning and information > retrieval from quite some time. > I took various courses/MOOC on Information retrieval and Text Mining and > have been working on real life datasets(KDD99,AWID,Movielens). > Because the problems you face in real life ML/IR scenario is different is > different from what taught in theory.I am also working on R&D on "Hybrid > Techniques for Intrusion Detection using Data Mining and Clustering on > Newer Datasets". > > Taking initial look at the docsim folder in xapian-core. > These are my insights > The clustering used is Single Link Agglomerative Hierarchical clustering. > Its Time Complexity is O(n^2) for n=number of documents. > At first Choosing K-means seems to be viable solution as K-Means has O(n) > Time Complexity. > But it has various Shortcomings > 1) The learning algorithm requires apriori specification of the number > of cluster centers. > 2)Different Initial Partitions can result in different final clusters > 3)It does not work well with clusters of different size and Different > Density. > After That we Can Think of KMeans++ > The *k*-means++ algorithm addresses the first of these obstacles by > specifying a procedure to initialize the cluster centers before proceeding > with the standard *k*-means optimization iterations > But it is a little bit slow due to cluster initialization. > Then we can think of bisecting k-means which is better than k-means.but the bisecting > K-means algorithm is a divisive hierarchical clustering algorithm > It is little bit faster than original k-means but the results of > clustering are poorer than Hierarchical agglomerative clustering > based on various Metrics of Cluster quality such as > Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio. > > based on my some time of Research,I have in mind a clustering algorithm > that can overcome Quality issues of K-means(and its variants) and Speed > Issues of Hierarchical Agglomerative Clustering. > Theoretically it can work O(n) and Can produce results better than HAC > based on various metrics. > I can't discuss it on mailing-list but you say we can discuss more about > it and its implementation in xapian in PM. > > Thank you for your Time > > > > > > > Regards, > Nirmal Singhania > B.tech III Yr >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/22b74b05/attachment.html>
On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:> based on my some time of Research,I have in mind a clustering algorithm > that can overcome Quality issues of K-means(and its variants) and Speed > Issues of Hierarchical Agglomerative Clustering. > Theoretically it can work O(n) and Can produce results better than HAC > based on various metrics. > I can't discuss it on mailing-list but you say we can discuss more about it > and its implementation in xapian in PM.Hi, Nirmal -- welcome to Xapian! What do you mean when you say that you can't discuss it on the mailing list? If you can't discuss it in our public forums, then you aren't going to be able to contribute that work to Xapian, are you? (Since we'll need it documented clearly, which will likely include useful details of that discussion.) J -- James Aylett, occasional trouble-maker xapian.org
I was not sharing it on maling list because i thought that someone can use all ideas i proposed in their GSOC proposal. Surely i will contribute to xapian project. sorry if that was against the rules The algorithm is not developed by me but after having much research on various clustering techniques. I found that there is a new algorithm called CLUBS(Clustering Using Binary Splitting) which gives better results than kmeans++ and hierarchical agglomerative clustering. It is faster and produces good results based on various metrics of cluster quality. the algorithm works in following way The first phase of the algorithm is divisive, as the original data set(in this case, set of search documents to cluster) is split recursively into miniclusters through successive binary splits: the algorithm?s second phase is agglomerative since these miniclusters are recombined into the final result. Further, the approach induces during execution a dynamic hierarchical grid that will better fit the dataset w.r.t. classical grid approaches that exploit a fixed grid instead. Finally, the algorithm exploits the analytical properties of the Quadratic Sums of Squares (SSQ in the following) function to minimize the cost of merge and split operations. you can refer the paper for more details about time complexity and performance http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf for implementing it,we can use Documentsource class in our previous clustering approach and create a binary tree and perform and topdownsplitting and then bottomup merging. First we have to implement feature extraction from text document(TFIDF would be a good choice) which is implemented in xapian weighting schemes. Then we will implement function to compute distances between documents based on normalized TF-IDF Matrix. Based on distances we will initially assign cluster and improve on it using topdownsplitting and then bottomup merging. We will compute SSQ of a cluster from all points in a cluster. and then Improve our clustering based on SSQ measures. The functions for topdownsplitting will include computeAverageDeltaSSQ to average the SSQ of all points in cluster. The functions for bottomupmerging will include computeweightedSSQ. . Please Give your valuable suggestions. Have a Nice Day. Regards, Nirmal Singhania B.tech III Yr On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote: > > > based on my some time of Research,I have in mind a clustering algorithm > > that can overcome Quality issues of K-means(and its variants) and Speed > > Issues of Hierarchical Agglomerative Clustering. > > Theoretically it can work O(n) and Can produce results better than HAC > > based on various metrics. > > I can't discuss it on mailing-list but you say we can discuss more about > it > > and its implementation in xapian in PM. > > Hi, Nirmal -- welcome to Xapian! What do you mean when you say that > you can't discuss it on the mailing list? If you can't discuss it in > our public forums, then you aren't going to be able to contribute that > work to Xapian, are you? (Since we'll need it documented clearly, > which will likely include useful details of that discussion.) > > J > > -- > James Aylett, occasional trouble-maker > xapian.org >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160310/5ac9af7f/attachment-0001.html>