On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org> wrote:> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote: > > K-Means or something related certainly seems like a viable approach, > so what you'll need to do is to come up with a proposal of how you'd > implement this in Xapian (either with reference to the previous work, > or separately), and also how you'd go about evaluating the performance > of your implementation (both in terms of usefulness of the clustering, > and in terms of speed!). > > Thanks for the reply James!I went through the code in a little more detail and there are a few things I noticed and a few questions I have. First off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out. With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial centroids in k-means, to avoid poor clustering results. The distance between document vectors and centroids can be found out and documents are added to clusters accordingly, identified by their doc-id's. The new centroid is again found and this process will continue till convergence. https://en.wikipedia.org/wiki/K-means%2B%2B I am slightly unaware about performance evaluation but the cluster quality can be evaluated through F-measure and I guess we can check the running times of both the implementations to check for usefulness in terms of speed. My questions are: 1) Can you direct me on how to convert this raw idea into a proposal in context to Xapian with more detail? What areas do I focus on? 2) It would be great if you could elaborate a little on the performance evaluation part that I haven't been able to follow too well. Thanks! :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/320f659f/attachment.html>
> > > On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org> > wrote: > >> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote: >> >> K-Means or something related certainly seems like a viable approach, >> so what you'll need to do is to come up with a proposal of how you'd >> implement this in Xapian (either with reference to the previous work, >> or separately), and also how you'd go about evaluating the performance >> of your implementation (both in terms of usefulness of the clustering, >> and in terms of speed!). >> >> Thanks for the reply James! > I went through the code in a little more detail and there are a few things > I noticed and a few questions I have. > > First off, the distance metric used in the current implementation is the > cosine measure. Though useful, K-means implicitly uses Euclidian distance > as a measure of document similarity between two document term vectors. > Hence, simply creating one more class for a distance metric by just > inheriting the DocSim base class will be good. Using the tf-idf weights, we > can find term weights and instead of using these vectors for cosine > similarity, euclid distance can be found out. > > With a similarity measure in place, we can initialize the k centroids > using k-means++, an algorithm used for choosing the initial centroids in > k-means, to avoid poor clustering results. The distance between document > vectors and centroids can be found out and documents are added to clusters > accordingly, identified by their doc-id's. The new centroid is again found > and this process will continue till convergence. > > https://en.wikipedia.org/wiki/K-means%2B%2B > > I am slightly unaware about performance evaluation but the cluster quality > can be evaluated through F-measure and I guess we can check the running > times of both the implementations to check for usefulness in terms of > speed. > > My questions are: > 1) Can you direct me on how to convert this raw idea into a proposal in > context to Xapian with more detail? What areas do I focus on? > 2) It would be great if you could elaborate a little on the performance > evaluation part that I haven't been able to follow too well. > > Thanks! :) > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/63ed5325/attachment.html>
On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org <javascript:_e(%7B%7D,'cvml','james-xapian at tartarus.org');>> wrote:> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote: > > K-Means or something related certainly seems like a viable approach, > so what you'll need to do is to come up with a proposal of how you'd > implement this in Xapian (either with reference to the previous work, > or separately), and also how you'd go about evaluating the performance > of your implementation (both in terms of usefulness of the clustering, > and in terms of speed!). > > > Sorry about sending this msg twice. I made a mistake in the last mail andneeded to make sure this one will be read. Thanks for the reply James! I went through the code in a little more detail and there are a few things I noticed and a few questions I have. First off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out. With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial centroids in k-means, to avoid poor clustering results. The distance between document vectors and centroids can be found out and documents are added to clusters accordingly, identified by their doc-id's. The new centroid is again found and this process will continue till convergence. https://en.wikipedia.org/wiki/K-means%2B%2B I am slightly unaware about performance evaluation but the cluster quality can be evaluated through F-measure and I guess we can check the running times of both the implementations to check for usefulness in terms of speed. My questions are: 1) Can you direct me on how to convert this raw idea into a proposal in context to Xapian with more detail? What areas do I focus on? 2) It would be great if you could elaborate a little on the performance evaluation part that I haven't been able to follow too well. Thanks! :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/a5ce6299/attachment.html>
On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:> My questions are: > 1) Can you direct me on how to convert this raw idea into a proposal in > context to Xapian with more detail? What areas do I focus on?Our GSoC guide has an application template <https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you should use to structure your proposal. It has some recommendations on how you should lay out and think about a proposal, particularly in terms of planning your timeline, which in the past has proven one of the keys to a successful project.> 2) It would be great if you could elaborate a little on the performance > evaluation part that I haven't been able to follow too well.'Performance' is a tricky word in information retrieval, so I'll break this into two pieces: speed and quality. As the project notes say, the previous attempt at implementing clustering was far too slow to be practically useful. So that's the speed side: we want to be able to cluster a fairly large number of documents quickly (which would need some thought -- do we want to be able to cluster 1000 documents in under a second? 10,000 in a handful of seconds? or might 1000 documents in a handful of seconds be sufficient?). Quality can be judged in a number of ways, but we're generally trying to produce 'good' clusters as a human with knowledge of the subject area would create. There's some discussion of how you might evaluate this in Introduction to Information Retrieval, section 16.3 p356 (or online at http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html). It's perhaps worth pointing out that Hearst (2009, p200) suggests that monothetic clusters 'may be easier for users to understand', although it doesn't cite any specific work to back up this claim. But that may argue that a K-means based approach isn't necessarily going to be the most helpful in all cases; there may be other approaches worth considering instead. (That entire section on clustering is worth reading if you have access to the book.) Hearst, MA (2009) 'Search User Interfaces', CUP, Cambridge. J -- James Aylett, occasional trouble-maker xapian.org