thr3ads.net - Xapian devel - [Xapian-devel] [GSoC 2014] clustering of search results [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Saksham Maheshwari

2014-Mar-09 20:42 UTC

[Xapian-devel] [GSoC 2014] clustering of search results

Hello guys,

I was looking forward to participate in GSoc14. I have a decent knowledge
about c++ and parsers.
I was looking at the idea pages where I found many interesting projects, in
which "clustering of search results" interests me the most. I want
someone
to take me to the right track in understanding the project so that I can
think about its implementation.
According to me, in this project we have to set some relations between the
query words from dataset or corpus and thus to retrieve the best
document(s).
Is this project somehow equivalent to set expansion?

Hoping for a good discussion :-)

Thanks
Saksham
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140310/75450f4b/attachment-0002.html>

Olly Betts

2014-Mar-10 10:29 UTC

head link

[Xapian-devel] [GSoC 2014] clustering of search results

On Mon, Mar 10, 2014 at 02:12:59AM +0530, Saksham Maheshwari
wrote:> I was looking at the idea pages where I found many interesting projects, in
> which "clustering of search results" interests me the most. I
want someone
> to take me to the right track in understanding the project so that I can
> think about its implementation.
> According to me, in this project we have to set some relations between the
> query words from dataset or corpus and thus to retrieve the best
> document(s).
Exactly what approach the project takes isn't nailed down - it just
seemed something which would be interesting for a student to work on,
and would be useful to Xapian users.

My understanding of the current clustering branch (which may not be
completely accurate) is that it clusters based on a pair-wise measure
of document similarity, and that the user can specify which terms from
the documents are used.  I think you'd consider more than just the words
in the query - in a typical case, the query is short and the top N
documents will match all the words in it.

It's an open question whether the project should be based on the
existing code or not, but I think it should at least attempt to learn
from the existing code - it would be a real shame to spend 3 months
working on this only to end up with two different clustering
implementations, neither of which is usable on larger sets of documents.

I think the clustering would probably be based on the terms in the
documents (I can't really think what else it would be based on).
Possibly using Xapian's query expansion feature (Enquire::get_eset()) to
generate a more restricted list of "interesting" terms to consider
would
help.
> Is this project somehow equivalent to set expansion?
That's related to clustering, but it isn't completely equivalent.

As an example, one way you could generate clusters is to think of each
document as a point in a multi-dimensional space, where each dimension
represents a different term with the distance in that direction being
something like (within_document_frequency / document_length).  In this
space, the distance between two identical documents is 0, and documents
which are more different will tend to be further apart (one word
changed is a small distance; no words in common is a long way apart).

Clustering is then splitting the documents into groups which are near
each other in that space.

Set expansion would mean picking some seed documents to start the sets,
and then going through the remaining documents adding them to the
"nearest" set (by some measure).  These sets are really just the same
as clusters (at least if each document belongs to exactly one cluster)
so this is a way to get you a fixed number of clusters, but this is not
the only way to generate a fixed number of clusters, and not all
clustering starts out looking for a fixed number of clusters.

Cheers,
    Olly

Saksham Maheshwari

2014-Mar-10 14:07 UTC

head link

[Xapian-devel] [GSoC 2014] clustering of search results

On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <olly at survex.com> wrote:
> Exactly what approach the project takes isn't nailed down - it just
> seemed something which would be interesting for a student to work on,
> and would be useful to Xapian users.
>
> My understanding of the current clustering branch (which may not be
> completely accurate) is that it clusters based on a pair-wise measure
> of document similarity, and that the user can specify which terms from
> the documents are used.  I think you'd consider more than just the
words
> in the query - in a typical case, the query is short and the top N
> documents will match all the words in it.
>
So, what you are saying is that we need to
 1. Assign similar objects to the same subset
 2. Assign dissimilar objects to different subsets
that is, we are trying to make disjoint subsets.
>
> It's an open question whether the project should be based on the
> existing code or not, but I think it should at least attempt to learn
> from the existing code - it would be a real shame to spend 3 months
> working on this only to end up with two different clustering
> implementations, neither of which is usable on larger sets of documents.
>
> I think the clustering would probably be based on the terms in the
> documents (I can't really think what else it would be based on).
> Possibly using Xapian's query expansion feature (Enquire::get_eset())
to
> generate a more restricted list of "interesting" terms to
consider would
> help.
>
Yes, what I know is that clustering will be based on number of clusters
i.e. no. of disjoints sets for that particular document.

Enquire:::get_eset() will return the expand set of related documents. The
corresponding "Xapian::ExpandDecider * edecider " will decide which
document has to be inserted in the expand set.

>
>That's related to clustering, but it isn't completely
equivalent.>
> As an example, one way you could generate clusters is to think of each
> document as a point in a multi-dimensional space, where each dimension
> represents a different term with the distance in that direction being
> something like (within_document_frequency / document_length).  In this
> space, the distance between two identical documents is 0, and documents
> which are more different will tend to be further apart (one word
> changed is a small distance; no words in common is a long way apart).
>
> Clustering is then splitting the documents into groups which are near
> each other in that space.
>
So, here indirectly you are talking about is the vector space model where
we will measure the relatedness between the sets on the basis of their
Euclidean distances. Thus, using k-means algorithm ?

> Set expansion would mean picking some seed documents to start the sets,
> and then going through the remaining documents adding them to the
> "nearest" set (by some measure).  These sets are really just the
same
> as clusters (at least if each document belongs to exactly one cluster)
> so this is a way to get you a fixed number of clusters, but this is not
> the only way to generate a fixed number of clusters, and not all
> clustering starts out looking for a fixed number of clusters.
>
Can you please tell me how should I proceed?
What should I do to start with the project ?

Thanks,
Saksham

PS: Is the project-related to speed up the existing code, that is, to make
it work faster or something else, along with some good merging algorithm to
merge the results.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140310/ff35e024/attachment-0002.html>

Reasonably Related Threads

Search for more apparently analagous threads

Xapian devel - Mar 2014 - [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results

[Xapian-devel] [GSoC 2014] clustering of search results

Reasonably Related Threads