Hello devs, I am Richhiey Thomas, pursuing my third year of undergraduate studies in Computer Science from Mumbai University. I had gone through the project list for this year and the project idea based on clustering caught my attention. I spoke to Assem Chelli on IRC who guided me to the code and got me started. I started going through the code and have successfully built Xapian on my machine. If I am not mistaken, the currently implemented clustering branch has used heirarchial clustering which has quadratic complexity or higher, which naturally makes it very inefficient for large data sets. A better approach to clustering would be to use K-means clustering or a variant (like bisecting K-means) which provides equal or better performance than the implemented heirarchial technique having lesser (linear) complexity, thus being more favourable to large datasets. It would be great to have your opinions on how to go about this or any other approaches that may be more favorable for document clustering in the context of Xapian. Thanks! :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160305/cd3c8695/attachment.html>
On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:> I am Richhiey Thomas, pursuing my third year of undergraduate studies in > Computer Science from Mumbai University. I had gone through the project > list for this year and the project idea based on clustering caught my > attention. I spoke to Assem Chelli on IRC who guided me to the code and got > me started.Hi Richhiey! That's great, and it's good that IRC is working for you. (People at some universities occasionally have trouble getting onto IRC, unfortunately.)> I started going through the code and have successfully built Xapian on my > machine. If I am not mistaken, the currently implemented clustering branch > has used heirarchial clustering which has quadratic complexity or higher, > which naturally makes it very inefficient for large data sets. > > A better approach to clustering would be to use K-means clustering or a > variant (like bisecting K-means) which provides equal or better performance > than the implemented heirarchial technique having lesser (linear) > complexity, thus being more favourable to large datasets. > > It would be great to have your opinions on how to go about this or any > other approaches that may be more favorable for document clustering in the > context of Xapian.K-Means or something related certainly seems like a viable approach, so what you'll need to do is to come up with a proposal of how you'd implement this in Xapian (either with reference to the previous work, or separately), and also how you'd go about evaluating the performance of your implementation (both in terms of usefulness of the clustering, and in terms of speed!). Hopefully that's enough to get you started, but please do ask any questions you have while working through this. J -- James Aylett, occasional trouble-maker xapian.org
On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org> wrote:> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote: > > K-Means or something related certainly seems like a viable approach, > so what you'll need to do is to come up with a proposal of how you'd > implement this in Xapian (either with reference to the previous work, > or separately), and also how you'd go about evaluating the performance > of your implementation (both in terms of usefulness of the clustering, > and in terms of speed!). > > Thanks for the reply James!I went through the code in a little more detail and there are a few things I noticed and a few questions I have. First off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out. With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial centroids in k-means, to avoid poor clustering results. The distance between document vectors and centroids can be found out and documents are added to clusters accordingly, identified by their doc-id's. The new centroid is again found and this process will continue till convergence. https://en.wikipedia.org/wiki/K-means%2B%2B I am slightly unaware about performance evaluation but the cluster quality can be evaluated through F-measure and I guess we can check the running times of both the implementations to check for usefulness in terms of speed. My questions are: 1) Can you direct me on how to convert this raw idea into a proposal in context to Xapian with more detail? What areas do I focus on? 2) It would be great if you could elaborate a little on the performance evaluation part that I haven't been able to follow too well. Thanks! :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/320f659f/attachment.html>