Shreedhar Pawar
2014-Mar-26 17:32 UTC
[Xapian-devel] Contribute in the clustering project.
Hi everyone, I was not able to submit my proposal due to some last minute error. But, even then I would love to contribute in the Clustering project. I had proposed a Scalable and efficent heirarchical clustering method which was parallelizable on multicore CPUs and clustered upto 150,000 documents in 2 seconds on a dual core CPU. Below are the details of the algorithm.... Please let me know if any of this will help... The present implementation of the clustering algorithm in the Xapian library which is a Single Linkage hierarchical clustering lacks in addressing the problem of larger datasets with limited memory resources. Hence as the dataset size increases it doesn't scale up to the memory requirements and the running time. The BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies) algorithm[1] is a scalable and efficient clustering algorithm that uses an in-memory data-structure called as CF-tree(Cluster Feature-tree) which is inspired from the B+ tree and Red Black tree. This algorithm is easily parallelizable as per mentioned in [2] and gives even more speed-ups with multi-core CPUs. For 150,000 documents the speed with a dual-core CPU is just 1.9 seconds as mentioned in [2]. *BIRCH Algorithm: *Based on a semantic similarity of each documents the distance metric is calculated between each documents. The BIRCH algorithm involves of forming a CF-tree which is a tree consisting all the entries of the sub-clusters. Its only the initial scan, that involves the i/o operations, while forming the sub-clusters; the dense data-points are clustered together and the sparse ones are not allowed. The later operations are all in-memory operations. This makes the BIRCH algorithm faster than any other hierarchical clustering algorithm in terms of memory operations as well as execution runtime*. *As the data goes on piling, sub-clusters increase. Complexity of this algorithm is *O(n**2**). * *CF-tree : *The CF-tree consists of 3 entries: CF=(N,LS,SS), where N is the number of data-points, LS is the linear sum of the N data-points i.e. Xi and SS is the square sum of the N data-points i.e. Xi2 . Its a height balanced tree with a branching factor B for non-leaf nodes and L for a leaf node, that depends on the depth of the tree. Also depending on the page size(decided on the available memory), a threshold T is decided, and not more than this size, data-points are allowed in the each sub-cluster. To balance the tree and for the efficient scan, each non-leaf node contains at the B entries of the form{ CFi ,childi } where i=1,2,... B, childi is a pointer to its i-th child node and CFi is the CF entry represented by this child(This can be compared to the B+ tree way of inserting nodes and balancing them like in the Red-black trees). A leaf node contains at most L entries and each entry is a CF. In addition, prev and next pointers are also given to the leaf node to chain its siblings for an efficient scan. The CF-tree entry that define the average inter-cluster distance, average intra-cluster distance and variance increase distance can be referred in [1] *Parallel BIRCH: *The parallel BIRCH algorithm is based on the single Program multiple Data(SPMD) model using message passing. The leaf-nodes formed in the CF tree can be distributed among the processors and the processors then communicate with each other to share the inter-cluster distances. This model is fruitful because there are much less leaf-nodes in the tree that include many sub-clusters, hence the information shared is very less and doesn't hamper the performance of the algorithm. The algorithm terminates when the change in clusters between consecutive iterations becomes less than some specified threshold. The main steps of the PBIRCH algorithm are: 1. Data distribution : Initially data is divided equally among all processors. When new data items arrive they are distributed cyclically to all the processors. 2. Computing the CF trees concurrently: Initial in memory CF tree is generated by each processor using local data. If a tree runs out of memory at any point of time then it is rebuilt by increasing the threshold . 3. Apply clustering on the leaf nodes: Apply a parallel clustering algorithm on the CFs stored at the leaf nodes. Parallel languages like OpenMP or OpenCL can be used to program the CPU hardware....! *References:* [1] Birch: A new Data Clustering algorithm and its Application. [2] PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data. I know that this is an entire algorithm to be implemented, but maybe some parts of it might be useful in the project you people do. Cheers, Shreedhar. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140326/3b075f95/attachment-0002.html>