I am still trying to find some faster clustering technique for search result. One technique which strike to me is, using the Latent Semantic Indexing for Clustering the search result can give better results. In which we don't even need to iterate over different values of 'k'(in K-means algorithm) to cluster documents rather we can cluster whole search result in one go. How Latent semantic indexing would help? In LSI we project query (considering as a pseudo document) on to the term-document vector space and based on some threshold we find the relevant documents. Very similarly if we use LSI for clustering, and instead of query if we take one of our search result and set different thresholds and based on each threshold we can cluster the search result at single shot. I am not sure this technique would be 100% helpful, that's why I need to first test this algorithm, please help me to figure this out. Murtuza Bohra -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160322/d791e69c/attachment.html>
On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote:> I am still trying to find some faster clustering technique for search > result. One technique which strike to me is, using the Latent Semantic > Indexing for Clustering the search result can give better results. In which > we don't even need to iterate over different values of 'k'(in K-means > algorithm) to cluster documents rather we can cluster whole search result > in one go. > > I am not sure this technique would be 100% helpful, that's why I > need to first test this algorithm, please help me to figure this > out.Are you suggesting writing an implementation quickly and seeing what happens with some real queries? Because that does sound like a good way of deciding whether the algorithm is useful in our case, but you don't have long -- about three days -- until your proposal need to be in, so I don't know if you have enough time to do that and write everything up. Alternatively, if you have an algorithm you are confident will be fast enough and provide some useful clustering, then you could implement that, and as a stretch goal in the project extend the system to allow other algorithms to be used, and look then at implementing LSI. (If you've delivered something earlier in the project, it'd be fine to do something a bit more speculative later.) J -- James Aylett, occasional trouble-maker xapian.org
On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote:> How Latent semantic indexing would help? > > In LSI we project query (considering as a pseudo document) on to the > term-document vector space and based on some threshold we find the relevant > documents. Very similarly if we use LSI for clustering, and instead of > query if we take one of our search result and set different thresholds and > based on each threshold we can cluster the search result at single shot.So if I follow, you take one document (how do you decide which) and then generate a set of clusters as (multi-dimensional) rings around it of increasing radius? That doesn't sound like it's going to do a good job of producing useful clusters. The group around the "seed" document is probably related, but once you get beyond that the documents in the cluster are defined only by distance from the seed. In geographical terms, locations which are < 10km from a given point might be a useful cluster, but locations between 10 and 20km from that point is much less likely to be. Cheers, Olly
Hello sir, You have interpreted correctly that clustering will be done by generating the ring around the Document(i.e. the basic idea of LSI). But it is not like increasing the radius and the next shell will be another cluster, Rather it would pick one document (based on relevance score) and form a ring around it to cluster the document, then from the remaining documents(not in the cluster but are there in the search result) again another document will be picked and next cluster will be formed, this will go on till all the search results are exhausted. I have attached a file to geometrically illustrate the algorithm, please have a look at it. On Wed, Mar 23, 2016 at 12:21 AM, Olly Betts <olly at survex.com> wrote:> On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote: > > How Latent semantic indexing would help? > > > > In LSI we project query (considering as a pseudo document) on to the > > term-document vector space and based on some threshold we find the > relevant > > documents. Very similarly if we use LSI for clustering, and instead of > > query if we take one of our search result and set different thresholds > and > > based on each threshold we can cluster the search result at single shot. > > So if I follow, you take one document (how do you decide which) and then > generate a set of clusters as (multi-dimensional) rings around it of > increasing radius? > > That doesn't sound like it's going to do a good job of producing useful > clusters. The group around the "seed" document is probably related, > but once you get beyond that the documents in the cluster are defined > only by distance from the seed. > > In geographical terms, locations which are < 10km from a given point > might be a useful cluster, but locations between 10 and 20km from that > point is much less likely to be. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/2a9be78a/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: LSI_Clustering.jpg Type: image/jpeg Size: 1476831 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/2a9be78a/attachment-0001.jpg>