Hello sir, You have interpreted correctly that clustering will be done by generating the ring around the Document(i.e. the basic idea of LSI). But it is not like increasing the radius and the next shell will be another cluster, Rather it would pick one document (based on relevance score) and form a ring around it to cluster the document, then from the remaining documents(not in the cluster but are there in the search result) again another document will be picked and next cluster will be formed, this will go on till all the search results are exhausted. I have attached a file to geometrically illustrate the algorithm, please have a look at it. On Wed, Mar 23, 2016 at 12:21 AM, Olly Betts <olly at survex.com> wrote:> On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote: > > How Latent semantic indexing would help? > > > > In LSI we project query (considering as a pseudo document) on to the > > term-document vector space and based on some threshold we find the > relevant > > documents. Very similarly if we use LSI for clustering, and instead of > > query if we take one of our search result and set different thresholds > and > > based on each threshold we can cluster the search result at single shot. > > So if I follow, you take one document (how do you decide which) and then > generate a set of clusters as (multi-dimensional) rings around it of > increasing radius? > > That doesn't sound like it's going to do a good job of producing useful > clusters. The group around the "seed" document is probably related, > but once you get beyond that the documents in the cluster are defined > only by distance from the seed. > > In geographical terms, locations which are < 10km from a given point > might be a useful cluster, but locations between 10 and 20km from that > point is much less likely to be. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/2a9be78a/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: LSI_Clustering.jpg Type: image/jpeg Size: 1476831 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/2a9be78a/attachment-0001.jpg>
I think should explain the proposed algorithm in the proposal more clearly. I did not do that because I thought it would make the proposal lengthy. Is there a word limit for the proposal?? On Wed, Mar 23, 2016 at 4:40 PM, MURTUZA BOHRA <murtuzabohra88 at gmail.com> wrote:> Hello sir, > > You have interpreted correctly that clustering will be done by generating > the ring around the Document(i.e. the basic idea of LSI). But it is not > like increasing the radius and the next shell will be another cluster, > Rather it would pick one document (based on relevance score) and form a > ring around it to cluster the document, then from the remaining > documents(not in the cluster but are there in the search result) again > another document will be picked and next cluster will be formed, this will > go on till all the search results are exhausted. > > I have attached a file to geometrically illustrate the algorithm, please > have a look at it. > > On Wed, Mar 23, 2016 at 12:21 AM, Olly Betts <olly at survex.com> wrote: > >> On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote: >> > How Latent semantic indexing would help? >> > >> > In LSI we project query (considering as a pseudo document) on to the >> > term-document vector space and based on some threshold we find the >> relevant >> > documents. Very similarly if we use LSI for clustering, and instead of >> > query if we take one of our search result and set different thresholds >> and >> > based on each threshold we can cluster the search result at single shot. >> >> So if I follow, you take one document (how do you decide which) and then >> generate a set of clusters as (multi-dimensional) rings around it of >> increasing radius? >> >> That doesn't sound like it's going to do a good job of producing useful >> clusters. The group around the "seed" document is probably related, >> but once you get beyond that the documents in the cluster are defined >> only by distance from the seed. >> >> In geographical terms, locations which are < 10km from a given point >> might be a useful cluster, but locations between 10 and 20km from that >> point is much less likely to be. >> >> Cheers, >> Olly >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/836397f9/attachment.html>
The second reference in my proposal is a research paper which is based on Vactor Space Model for clustering the search result. My proposed algorithm is based on it but I slightly modified it. In the paper they are using vector space model to first find the popular phrases in the documents to label the different cluster, then based on the cluster label they are using LSI to find the relevant document for each cluster. Now in my algorithm I am doing the same thing but instead of finding different popular phrases I am using the search documents itself to cluster the search result and to have better results. On Wed, Mar 23, 2016 at 4:45 PM, MURTUZA BOHRA <murtuzabohra88 at gmail.com> wrote:> I think should explain the proposed algorithm in the proposal more > clearly. I did not do that because I thought it would make the proposal > lengthy. Is there a word limit for the proposal?? > > On Wed, Mar 23, 2016 at 4:40 PM, MURTUZA BOHRA <murtuzabohra88 at gmail.com> > wrote: > >> Hello sir, >> >> You have interpreted correctly that clustering will be done by generating >> the ring around the Document(i.e. the basic idea of LSI). But it is not >> like increasing the radius and the next shell will be another cluster, >> Rather it would pick one document (based on relevance score) and form a >> ring around it to cluster the document, then from the remaining >> documents(not in the cluster but are there in the search result) again >> another document will be picked and next cluster will be formed, this will >> go on till all the search results are exhausted. >> >> I have attached a file to geometrically illustrate the algorithm, please >> have a look at it. >> >> On Wed, Mar 23, 2016 at 12:21 AM, Olly Betts <olly at survex.com> wrote: >> >>> On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote: >>> > How Latent semantic indexing would help? >>> > >>> > In LSI we project query (considering as a pseudo document) on to the >>> > term-document vector space and based on some threshold we find the >>> relevant >>> > documents. Very similarly if we use LSI for clustering, and instead of >>> > query if we take one of our search result and set different thresholds >>> and >>> > based on each threshold we can cluster the search result at single >>> shot. >>> >>> So if I follow, you take one document (how do you decide which) and then >>> generate a set of clusters as (multi-dimensional) rings around it of >>> increasing radius? >>> >>> That doesn't sound like it's going to do a good job of producing useful >>> clusters. The group around the "seed" document is probably related, >>> but once you get beyond that the documents in the cluster are defined >>> only by distance from the seed. >>> >>> In geographical terms, locations which are < 10km from a given point >>> might be a useful cluster, but locations between 10 and 20km from that >>> point is much less likely to be. >>> >>> Cheers, >>> Olly >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/c91f8a4d/attachment.html>
On Wed, Mar 23, 2016 at 04:45:55PM +0530, MURTUZA BOHRA wrote:> I think should explain the proposed algorithm in the proposal more > clearly. I did not do that because I thought it would make the > proposal lengthy. Is there a word limit for the proposal??No word limit, no. You can reference existing papers or books if the algorithm is already well-described, but you should mention any modifications in the way you're proposing to do things for Xapian, including where the data comes from (a lot of papers assume a theoretical idealised data model, so you'd need to show that you know where that data exists in a Xapian database, or include time in your proposal for adding it if there's data required which doesn't currently exist in Xapian). J -- James Aylett, occasional trouble-maker xapian.org