thr3ads.net - Xapian devel - [Xapian-devel] [GSoC 2014] clustering of search results [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Saksham Maheshwari

2014-Mar-10 14:07 UTC

[Xapian-devel] [GSoC 2014] clustering of search results

On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <olly at survex.com> wrote:
> Exactly what approach the project takes isn't nailed down - it just
> seemed something which would be interesting for a student to work on,
> and would be useful to Xapian users.
>
> My understanding of the current clustering branch (which may not be
> completely accurate) is that it clusters based on a pair-wise measure
> of document similarity, and that the user can specify which terms from
> the documents are used.  I think you'd consider more than just the
words
> in the query - in a typical case, the query is short and the top N
> documents will match all the words in it.
>
So, what you are saying is that we need to
 1. Assign similar objects to the same subset
 2. Assign dissimilar objects to different subsets
that is, we are trying to make disjoint subsets.
>
> It's an open question whether the project should be based on the
> existing code or not, but I think it should at least attempt to learn
> from the existing code - it would be a real shame to spend 3 months
> working on this only to end up with two different clustering
> implementations, neither of which is usable on larger sets of documents.
>
> I think the clustering would probably be based on the terms in the
> documents (I can't really think what else it would be based on).
> Possibly using Xapian's query expansion feature (Enquire::get_eset())
to
> generate a more restricted list of "interesting" terms to
consider would
> help.
>
Yes, what I know is that clustering will be based on number of clusters
i.e. no. of disjoints sets for that particular document.

Enquire:::get_eset() will return the expand set of related documents. The
corresponding "Xapian::ExpandDecider * edecider " will decide which
document has to be inserted in the expand set.

>
>That's related to clustering, but it isn't completely
equivalent.>
> As an example, one way you could generate clusters is to think of each
> document as a point in a multi-dimensional space, where each dimension
> represents a different term with the distance in that direction being
> something like (within_document_frequency / document_length).  In this
> space, the distance between two identical documents is 0, and documents
> which are more different will tend to be further apart (one word
> changed is a small distance; no words in common is a long way apart).
>
> Clustering is then splitting the documents into groups which are near
> each other in that space.
>
So, here indirectly you are talking about is the vector space model where
we will measure the relatedness between the sets on the basis of their
Euclidean distances. Thus, using k-means algorithm ?

> Set expansion would mean picking some seed documents to start the sets,
> and then going through the remaining documents adding them to the
> "nearest" set (by some measure).  These sets are really just the
same
> as clusters (at least if each document belongs to exactly one cluster)
> so this is a way to get you a fixed number of clusters, but this is not
> the only way to generate a fixed number of clusters, and not all
> clustering starts out looking for a fixed number of clusters.
>
Can you please tell me how should I proceed?
What should I do to start with the project ?

Thanks,
Saksham

PS: Is the project-related to speed up the existing code, that is, to make
it work faster or something else, along with some good merging algorithm to
merge the results.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140310/ff35e024/attachment-0002.html>

Olly Betts

2014-Mar-11 13:01 UTC

head link

[Xapian-devel] [GSoC 2014] clustering of search results

On Mon, Mar 10, 2014 at 07:37:30PM +0530, Saksham Maheshwari
wrote:> On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <olly at survex.com>
wrote:
> 
> > Exactly what approach the project takes isn't nailed down - it
just
> > seemed something which would be interesting for a student to work on,
> > and would be useful to Xapian users.
> >
> > My understanding of the current clustering branch (which may not be
> > completely accurate) is that it clusters based on a pair-wise measure
> > of document similarity, and that the user can specify which terms from
> > the documents are used.  I think you'd consider more than just the
words
> > in the query - in a typical case, the query is short and the top N
> > documents will match all the words in it.
> 
> So, what you are saying is that we need to
>  1. Assign similar objects to the same subset
>  2. Assign dissimilar objects to different subsets
> that is, we are trying to make disjoint subsets.
I think so.  While there are clustering techniques where an item can be
in more than one cluster, or in which items have a degree of belonging
to each cluster, I'd say a strict partition into disjoint subsets is
likely to be most useful for presenting search results in easier to
comprehend ways.
> > It's an open question whether the project should be based on the
> > existing code or not, but I think it should at least attempt to learn
> > from the existing code - it would be a real shame to spend 3 months
> > working on this only to end up with two different clustering
> > implementations, neither of which is usable on larger sets of
documents.
> >
> > I think the clustering would probably be based on the terms in the
> > documents (I can't really think what else it would be based on).
> > Possibly using Xapian's query expansion feature
(Enquire::get_eset()) to
> > generate a more restricted list of "interesting" terms to
consider would
> > help.
> 
> Yes, what I know is that clustering will be based on number of clusters
> i.e. no. of disjoints sets for that particular document.
I'm not sure I follow what you're saying here.  Isn't each
particular
document going to be in exactly one set?
> Enquire:::get_eset() will return the expand set of related documents. The
> corresponding "Xapian::ExpandDecider * edecider " will decide
which
> document has to be inserted in the expand set.
No, get_eset() returns a set of *TERMS*, not documents.  It works
based on a set of documents marked as relevant (an RSet).

And an ExpandDecider decides which *TERMS* to allow into the expand set.

What I was suggesting is that you could potentially generate an ESet and
use that as a restricted list of terms to consider.  But generating an
ESet does takes a little time in itself.  Perhaps it would be better to
just pick the terms with the highest "discriminating power" (really
common terms aren't interesting, but neither are the really rare ones -
at the extremes, a term is no help to clustering if it is in all
documents or if it is in only one).
> So, here indirectly you are talking about is the vector space model where
> we will measure the relatedness between the sets on the basis of their
> Euclidean distances. Thus, using k-means algorithm ?
It was just an example which I hoped would help you visualise what I was
talking about.  I'm not trying to say the clustering should be done
based on a vector space model.

Using k-means clustering is an option, though it has the drawback that
you need to decide how many clusters you want first, but there might be
a different number of natural clusters in the data.  It would be more
satisfactory to be able to say you want (say) "about 7 clusters", or
"between 6 and 10 clusters".  Running the k-means clustering multiple
times to try different numbers of clusters probably isn't feasible for
our applications.
> Can you please tell me how should I proceed?  What should I do to
> start with the project ?
If you've not already done so, the first step is really to check out the
Xapian code and get it to build.

If/once you've done that, I'd suggest looking through the existing
clustering code and working out what it is doing.  If you're familiar
with profiling tools, then you could even try profiling it to see
where the time is spent and what might be causing it to scale poorly.
> PS: Is the project-related to speed up the existing code, that is, to make
> it work faster or something else,
The important goal is to have a clustering feature which works at a
usable speed.

That could mean taking the existing clustering code and making it work
faster, or it could mean replacing it with new code.  If we start
again, I think it would be prudent to at least try to understand why
the current code doesn't scale as we'd like.  If we don't have some
idea why the previous attempt didn't work out, we risk repeating the
same mistakes.
 > along with some good merging algorithm to merge the results.
I'm confused - merging what results?

Cheers,
    Olly

Saksham Maheshwari

2014-Mar-21 16:09 UTC

head link

[Xapian-devel] [GSoC 2014] clustering of search results

Hello all,

I have submitted the final proposal for this project.
Here's the link:

https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/sam_1993/5707702298738688?validated=True#

I would like to hear what's missing in this, so that I can do the necessary
changes. It would really be a great help.

Thanks,
Saksham


On Tue, Mar 11, 2014 at 6:31 PM, Olly Betts <olly at survex.com> wrote:
> On Mon, Mar 10, 2014 at 07:37:30PM +0530, Saksham Maheshwari wrote:
> > On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <olly at survex.com>
wrote:
> >
> > > Exactly what approach the project takes isn't nailed down -
it just
> > > seemed something which would be interesting for a student to work
on,
> > > and would be useful to Xapian users.
> > >
> > > My understanding of the current clustering branch (which may not
be
> > > completely accurate) is that it clusters based on a pair-wise
measure
> > > of document similarity, and that the user can specify which terms
from
> > > the documents are used.  I think you'd consider more than
just the
> words
> > > in the query - in a typical case, the query is short and the top
N
> > > documents will match all the words in it.
> >
> > So, what you are saying is that we need to
> >  1. Assign similar objects to the same subset
> >  2. Assign dissimilar objects to different subsets
> > that is, we are trying to make disjoint subsets.
>
> I think so.  While there are clustering techniques where an item can be
> in more than one cluster, or in which items have a degree of belonging
> to each cluster, I'd say a strict partition into disjoint subsets is
> likely to be most useful for presenting search results in easier to
> comprehend ways.
>
> > > It's an open question whether the project should be based on
the
> > > existing code or not, but I think it should at least attempt to
learn
> > > from the existing code - it would be a real shame to spend 3
months
> > > working on this only to end up with two different clustering
> > > implementations, neither of which is usable on larger sets of
> documents.
> > >
> > > I think the clustering would probably be based on the terms in
the
> > > documents (I can't really think what else it would be based
on).
> > > Possibly using Xapian's query expansion feature
(Enquire::get_eset())
> to
> > > generate a more restricted list of "interesting" terms
to consider
> would
> > > help.
> >
> > Yes, what I know is that clustering will be based on number of
clusters
> > i.e. no. of disjoints sets for that particular document.
>
> I'm not sure I follow what you're saying here.  Isn't each
particular
> document going to be in exactly one set?
>
> > Enquire:::get_eset() will return the expand set of related documents.
The
> > corresponding "Xapian::ExpandDecider * edecider " will
decide which
> > document has to be inserted in the expand set.
>
> No, get_eset() returns a set of *TERMS*, not documents.  It works
> based on a set of documents marked as relevant (an RSet).
>
> And an ExpandDecider decides which *TERMS* to allow into the expand set.
>
> What I was suggesting is that you could potentially generate an ESet and
> use that as a restricted list of terms to consider.  But generating an
> ESet does takes a little time in itself.  Perhaps it would be better to
> just pick the terms with the highest "discriminating power"
(really
> common terms aren't interesting, but neither are the really rare ones -
> at the extremes, a term is no help to clustering if it is in all
> documents or if it is in only one).
>
> > So, here indirectly you are talking about is the vector space model
where
> > we will measure the relatedness between the sets on the basis of their
> > Euclidean distances. Thus, using k-means algorithm ?
>
> It was just an example which I hoped would help you visualise what I was
> talking about.  I'm not trying to say the clustering should be done
> based on a vector space model.
>
> Using k-means clustering is an option, though it has the drawback that
> you need to decide how many clusters you want first, but there might be
> a different number of natural clusters in the data.  It would be more
> satisfactory to be able to say you want (say) "about 7 clusters",
or
> "between 6 and 10 clusters".  Running the k-means clustering
multiple
> times to try different numbers of clusters probably isn't feasible for
> our applications.
>
> > Can you please tell me how should I proceed?  What should I do to
> > start with the project ?
>
> If you've not already done so, the first step is really to check out
the
> Xapian code and get it to build.
>
> If/once you've done that, I'd suggest looking through the existing
> clustering code and working out what it is doing.  If you're familiar
> with profiling tools, then you could even try profiling it to see
> where the time is spent and what might be causing it to scale poorly.
>
> > PS: Is the project-related to speed up the existing code, that is, to
> make
> > it work faster or something else,
>
> The important goal is to have a clustering feature which works at a
> usable speed.
>
> That could mean taking the existing clustering code and making it work
> faster, or it could mean replacing it with new code.  If we start
> again, I think it would be prudent to at least try to understand why
> the current code doesn't scale as we'd like.  If we don't have
some
> idea why the previous attempt didn't work out, we risk repeating the
> same mistakes.
>
> > along with some good merging algorithm to merge the results.
>
> I'm confused - merging what results?
>
> Cheers,
>     Olly
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140321/f078f56a/attachment-0002.html>

Xapian devel - Mar 2014 - [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results