thr3ads.net - Xapian devel - GSOC-2016 Project : Clustering of search results [Mar 2016]

If this information is useful, please help other people find it:
Share via:

James Aylett

2016-Mar-12 18:08 UTC

GSOC-2016 Project : Clustering of search results

On Sat, Mar 12, 2016 at 04:27:55PM +0530, Richhiey Thomas wrote:
> Below I write a raw version of my proposal for Clustering of Search Results
> based on our previous mails.
Hi, Richhiey. Thanks for putting this together ahead of the formal
start of applications, and sharing it with us -- and it's really not
too long! Project proposals for something that will last the summer
are naturally going to be long and get into a lot of detail, so don't
shy away from that.

Comments inline (and I've chopped out bits where I don't have anything
useful to add, but it should be obvious from context what I'm
discussing).

Do please note that before submitting this you'll need to put some
work in to make it match our template
(https://trac.xapian.org/wiki/GSoCApplicationTemplate). For instance,
you haven't explicitly answered the 'who will benefit' question; you
hint it at a little in your motivation section, but being as clear as
possible about that question often helps when considering other
details (such as 'how fast is fast enough', which is a key issue with
clustering).
> The algorithm for the given apporach will be as follows:
[snip]> The link to the paper I referred to for proposing this approach is:
> http://ictactjournals.in/paper/IJSCPaper_206-208.pdf
I'd like to ask a few questions about the paper, which hopefully you
can answer, because I'm a little confused by it in places.

For instance, in equation (3) it looks to me like a measure with two
free variables (i and j) is being defined in terms of two free
variables (j and m) and one bound one (i). Is the summation actually
supposed to be over m? (Then within the summation x_i is a dw
dimensional vector, as is c_j, ie they're both feature vectors, one
representing a single item and one representing a cluster centroid.)

I may be misreading something fundamentally here, because aspects of
the combined algorithm just don't make sense to me. It doesn't help my
understanding that the paper seems to use overlapping references for
the particle positions and clusters (eg in (3), (4) and I think (5)),
and the cluster centres that are being used as particle data (at the
bottom of p207). I think rewriting it with different variables might
make this a lot clearer. (That may be the main source of my confusion,
in fact!)

As I understand it, though, there are still some parameters to this
algorithm: the maximum iterations, the inertia weight and two
'gravity' constants for the velocity calculation in (1), and K. Is
that correct? Is there anything we can do to predict a useful value
for any of them?
> For implementing this, we will find the TermLists for every document and
> then pass this in a TermListGroup along with the docids to a function for
> the initialization step of the algorithm. In this step, we can create a
> vector of K-means centroids and every particle can contain a vector of
> cluster centers used by PSO.
You mean that you're using the term lists of each document to generate
the particle's position vector in a term-based feature space? (The
initial centroids for K means and the particle cluster centers are all
randomly initialised, correct?)

(That's what I'm expecting, I just want to be sure I'm understanding
both the paper and what you've written well enough!)
> Project Timeline:
Please don't be put off by the comments I'm making about the timeline
-- writing a plan like this that stretches over several months is
incredibly hard, even if you've done lots of them before. You've done
a good job of starting to break the problem down and think about how
you're going to manage that over the summer, but I think if you break
it down even further you'll end up with a more useful timeline that
lets you worry only about the work you're doing right now rather than
having to think about where you are overall as well.
> April 22 - May 22 (Before coding period starts) : Finish any more research
> required by talking to the mentors and any other resources. Have a clear
> idea of how to go about the project before the coding period
It would be good to identify if there is specific work required during
this period. What further research do you think is required? Is there
anything else you need beyond the cited paper in order to implement
their algorithm?
> May 23 - June 6 (Official coding period starts now) :
> Define all the classes and functions that will be necessary for the
> implementation of the project
Why is this going to take two weeks? It sounds from your discussion as
if you have a good grasp of how to map this into classes
already. Remember that we really recommend having timeline sections
of no more than a week. The more you force yourself to break it down,
the more planning you have to do up front, which generally results in
a smoother project!
> June 7 - June 20 :
> Implement and test the first phase which is a part dealing with the PSO
> algorithm
Is there a way you can break this into smaller pieces? For instance,
are there datasets for PSO out there which you could use to build test
cases, ahead of implementing the algorithm itself? If not, you'll have
to generate your own, calculating the outcome for some data sets by
some other method. (You should be able to use a different
implementation to calculate the test data, because most
implementations won't have license conditions which restrict the
subsequent distribution of output of the implementation.)

With step 6 of the algorithm, it looks to me like the K-means part
that is interwoven with PSO can affect the PSO part, by modifying
pgd. Does it make sense to tackle things in this way (PSO then
K-means), or would it make more sense to break it down as steps of the
algorithm, with each calculation tested separately? (It's probably
easier to write the tests that way.)
> June 21 - June 27 :
> Making further changes to the code to improve functionality, exception
> handling and bug removal
Exception handling is really something you should do as you go
along. I'm also not clear what 'improve functionality' means in this
context. Are you concerned that the previous two weeks won't be enough
time to complete the PSO algorithm? If so, you definitely want to
break it down into smaller pieces; three weeks is generally too long
for a single stretch of work (because it's difficult to tell two weeks
in if it's going well or not).
> June 28 - July 10 :
> Implement and test the second phase which deals with the K-means being
> actually used for clustering.
Again, hopefully you can break this down further.
> July 11 - July 16 :
> Making further changes to the code to improve functionality, exception
> handling and bug removal
And again, breaking down the previous block should mean you don't need
a 'catch up' period. Don't worry if your estimates go up when you
break something down further (so something that you thought would take
about two weeks, when broken into smaller pieces, might look like
three or even four weeks).
> July 17 - July 28 :
> Implement or use a concrete measure to test the performance of the newly
> implemented clustering algorithm against the previously implemented one in
> terms of speed and accuracy
Do you have an idea of what measure to use for accuracy? The paper you
cited uses between and within cluster measures, but there may be other
techniques. (Also, I don't know if you have access to any
pre-clustered test data, which would allow you to calculate a
precision rate.)

It's also going to be important to carefully choose the test data you
use for this. Is there anything in the TREC collections that would be
suitable?
> July 28 - August 1:
> Find out drawbacks that may come out while testing the proposed solution
> and tackle it during this time by changing some parts of the code that may
> increase speed or efficiency.
It's good to have a period to incorporate learning from the evaluation
phase. Is a week enough? (This would be one area where I might
consider having a longer period, and to be prepared to plan that while
or just after conducting the evaluation, based on what you learn.)
> August 2 - August 16:
> Cleaning the written code, improving the documentation or writing
> documentation afresh, identifying any bugs that may come out and writing
> test before the final submissions
GSoC has a slightly unfortunately habit of suggesting in their own
timeline that you work on documentation and tests at the end. We
strongly recommend that you work on them as you are writing the code
(with tests in particular, it's much easier to be confident with where
you are if you have tests as you go, but it's also easier to write
documentation for something you're working on right now, rather than
something you coded a month or two previously).

We generally recommend that you include some 'stretch' goals in your
project, for if everything goes particularly well and you have more
time at the end. So it would be good to have a think about what else
could fit within this project. But don't worry if you can't think of
any; I think with the evaluation phase and then working based on what
you learn from there that you should have enough to do!

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Richhiey Thomas

2016-Mar-13 20:39 UTC

head link

GSOC-2016 Project : Clustering of search results

On Sat, Mar 12, 2016 at 11:38 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sat, Mar 12, 2016 at 04:27:55PM +0530, Richhiey Thomas wrote:
>
> Hi, Richhiey. Thanks for putting this together ahead of the formal
> start of applications, and sharing it with us -- and it's really not
> too long! Project proposals for something that will last the summer
> are naturally going to be long and get into a lot of detail, so don't
> shy away from that.
>
> Thanks for the quick reply James!I'll try to answer all your questions in the best way possible by me and
send across a better version of this proposal as soon as possible, fitting
the Xapian GSOC application template properly (As you mentioned about the
project motivation part) :)

> I'd like to ask a few questions about the paper, which hopefully you
> can answer, because I'm a little confused by it in places.
>
> For instance, in equation (3) it looks to me like a measure with two
> free variables (i and j) is being defined in terms of two free
> variables (j and m) and one bound one (i). Is the summation actually
> supposed to be over m? (Then within the summation x_i is a dw
> dimensional vector, as is c_j, ie they're both feature vectors, one
> representing a single item and one representing a cluster centroid.)
>
> I may be misreading something fundamentally here, because aspects of
> the combined algorithm just don't make sense to me. It doesn't help
my
> understanding that the paper seems to use overlapping references for
> the particle positions and clusters (eg in (3), (4) and I think (5)),
> and the cluster centres that are being used as particle data (at the
> bottom of p207). I think rewriting it with different variables might
> make this a lot clearer. (That may be the main source of my confusion,
> in fact!)
>The way the paper has been written I guess is the main source of your
confusion. Let me provide a paper that explains this same concept in a way
that is easier to understand. I was confused by eq (3) that you mentioned
too. Here it is :
http://www.sau.ac.in/~vivek/softcomp/clustering%20PSO+K-means.pdf
The whole point of this approach is to use global search of PSO and fast
convergence of K-means together.
So we take every particle and assign it k randomly initialized cluster
centroids. Hence, every particle can be seen as X={C1,C2,..,CK}. Now for
each particle, in every iteration we have to calculate a fitness value for
which we use the equation that you mentioned above that tricked you :)
For every particle, we consider the cluster centroids that we have assigned
to that particle and assign all other documents to the closest of these
cluster centroids. Through this, we can obtain a fitness value which is
calculated by finding the summation of distance between all the documents
and all cluster centroids and taking its reciprocal or making some other
fitness function that we can define to suit or needs better. Using this
fitness function, the next location can be found for the particle.
The combined algorithm uses overlapping references such as cluster
centroids as particle data because we are using PSO to optimize this. Every
particle can be thought of as a candidate solution for the initial
centroids and PSO, being a method used to optimize the candidate solutions
iteratively by moving particles around the search space, helps in finding a
good initial centroid configuration. By the end of the maximum number of
iterations, we arrive at either a sub-optimal or optimal solution of
initial centroids which can then be fed as seed to the K-means algorithm to
provide better clustering results.
I don't know whether I have adequately answered your questions so you can
fire me if I've gone wrong or not been clear anywhere!
>
> As I understand it, though, there are still some parameters to this
> algorithm: the maximum iterations, the inertia weight and two
> 'gravity' constants for the velocity calculation in (1), and K. Is
> that correct? Is there anything we can do to predict a useful value
> for any of them?
I dont know of any ways to predict an optimum value for any of these
parameters though I'm sure there will be some mathematical way to do so.
But I think since we know how each of these parameters would affect the
clustering, we can find the correct values that work well when we test this
algorithm against the kinds of datasets this algorithm will deal with later
on.
> You mean that you're using the term lists of each document to generate
> the particle's position vector in a term-based feature space? (The
> initial centroids for K means and the particle cluster centers are all
> randomly initialised, correct?)
> (That's what I'm expecting, I just want to be sure I'm
understanding
> both the paper and what you've written well enough!)
Yes thats exactly what I meant!
  Project Timeline:
> > April 22 - May 22 (Before coding period starts) : Finish any more
> research
> > required by talking to the mentors and any other resources. Have a
clear
> > idea of how to go about the project before the coding period
>
> It would be good to identify if there is specific work required during
> this period. What further research do you think is required? Is there
> anything else you need beyond the cited paper in order to implement
> their algorithm?
During this period, the essential goal would be to finalize the particulars
that we will need in the project, For example, finalizing the fitness
measure that will be used for PSO, deciding values for the parameters that
we need for the mixed algorithm or finding a way to predict the optimum
value of each of those parameters if possible, and any other detail that
may be of concern before starting to implement the project
> > May 23 - June 6 (Official coding period starts now) :
> > Define all the classes and functions that will be necessary for the
> > implementation of the project
>
> Why is this going to take two weeks? It sounds from your discussion as
> if you have a good grasp of how to map this into classes
> already. Remember that we really recommend having timeline sections
> of no more than a week. The more you force yourself to break it down,
> the more planning you have to do up front, which generally results in
> a smoother project!
>Actually James, the problem is I will be having my university exams in this
period and hence, the amount of time I will be able to devote to the
project will be kind of restricted during these two weeks. That's the only
reason I let this part go slow. It still shouldn't take that much time, but
just incase it does, the project wont lag behind.
>
> > June 7 - June 20 :
> > Implement and test the first phase which is a part dealing with the
PSO
> > algorithm
>
> Is there a way you can break this into smaller pieces? For instance,
> are there datasets for PSO out there which you could use to build test
> cases, ahead of implementing the algorithm itself? If not, you'll have
> to generate your own, calculating the outcome for some data sets by
> some other method. (You should be able to use a different
> implementation to calculate the test data, because most
> implementations won't have license conditions which restrict the
> subsequent distribution of output of the implementation.)
>
> With step 6 of the algorithm, it looks to me like the K-means part
> that is interwoven with PSO can affect the PSO part, by modifying
> pgd. Does it make sense to tackle things in this way (PSO then
> K-means), or would it make more sense to break it down as steps of the
> algorithm, with each calculation tested separately? (It's probably
> easier to write the tests that way.)
>
> I think the way you suggested is a much better way to look at thingsbecause the two algorithms are tightly coupled and cannot be implemented
independently and tested the way I thought about it.
To break things down further, we can look at things like this
June 7-June 10
Create a class to represent particles and write code to initialize each
particle in the dataset with cluster-centroids
June 11 - June 16:
Implement code to find the fitness value for each particle and finding the
new location of the particle.
June 17-June 20:
Test the code against any dataset which contains documents so that we can
see how the code will be finding the initial centroids and so that we can
tweak the initial parameters to get good performance.
> June 21 - June 27 :
> > Making further changes to the code to improve functionality, exception
> > handling and bug removal
>
> Exception handling is really something you should do as you go
> along. I'm also not clear what 'improve functionality' means in
this
> context. Are you concerned that the previous two weeks won't be enough
> time to complete the PSO algorithm? If so, you definitely want to
> break it down into smaller pieces; three weeks is generally too long
> for a single stretch of work (because it's difficult to tell two weeks
> in if it's going well or not).
>Improve functionality was just meant to say that incase there were portions
that werent implemented completely during the above phase, it can be
completed now. If not, this period can be completely devoted to writing
test cases, or taking care of exception handling and bug removal, to make
sure the code written is in good shape,
>
> > June 28 - July 10 :
> > Implement and test the second phase which deals with the K-means being
> > actually used for clustering.
>
> Again, hopefully you can break this down further.
>Similar to what I did above, we will just go ahead with implementing the
second phase of the algorithm which now includes incorporating the K-means
algorithm with the seed cluster centroids.
June 28 - July 4:
Using cluster centroid vectors from PSO and implement assigning documents
to the closest centroids.

July 4 - July 12:
Reassigning cluster centroids and finishing up whatever is left in the
complete implementation of the algorithtm
>
> > July 11 - July 18:
> > Making further changes to the code to improve functionality, exception
> > handling and bug removal
>
> And again, breaking down the previous block should mean you don't need
> a 'catch up' period. Don't worry if your estimates go up when
you
> break something down further (so something that you thought would take
> about two weeks, when broken into smaller pieces, might look like
> three or even four weeks).
>I think this period will be essential for writing some test cases,
concentrating on making sure no bugs creep in and on keeping the code
indented and clean and exception free. Also, incase of a delay, this could
be time to level up. By this time, the implementation of the project should
be complete.
>
> > July 17 - July 28 :
> > Implement or use a concrete measure to test the performance of the
newly
> > implemented clustering algorithm against the previously implemented
one
> in
> > terms of speed and accuracy
>
> Do you have an idea of what measure to use for accuracy? The paper you
> cited uses between and within cluster measures, but there may be other
> techniques. (Also, I don't know if you have access to any
> pre-clustered test data, which would allow you to calculate a
> precision rate.)
>
> It's also going to be important to carefully choose the test data you
> use for this. Is there anything in the TREC collections that would be
> suitable?
>The new paper that I have referred to uses TREC-5, TREC-6, TREC-7 and so
these datasets can be used. Using entropy and within cluster frequency i
feel would be good indicators of the quality of clustering that we have got
>
> > July 28 - August 1:
> > Find out drawbacks that may come out while testing the proposed
solution
> > and tackle it during this time by changing some parts of the code that
> may
> > increase speed or efficiency.
>
> It's good to have a period to incorporate learning from the evaluation
> phase. Is a week enough? (This would be one area where I might
> consider having a longer period, and to be prepared to plan that while
> or just after conducting the evaluation, based on what you learn.)
>
> > August 2 - August 16:
> > Cleaning the written code, improving the documentation or writing
> > documentation afresh, identifying any bugs that may come out and
writing
> > test before the final submissions
>
> GSoC has a slightly unfortunately habit of suggesting in their own
> timeline that you work on documentation and tests at the end. We
> strongly recommend that you work on them as you are writing the code
> (with tests in particular, it's much easier to be confident with where
> you are if you have tests as you go, but it's also easier to write
> documentation for something you're working on right now, rather than
> something you coded a month or two previously).
>
> We generally recommend that you include some 'stretch' goals in
your
> project, for if everything goes particularly well and you have more
> time at the end. So it would be good to have a think about what else
> could fit within this project. But don't worry if you can't think
of
> any; I think with the evaluation phase and then working based on what
> you learn from there that you should have enough to do!
>
> I've tried to breakdown the timeline out here but I'm sure that
this isstill not perfect enough. So I'd love it if you could tell me where I am
going wrong and what more detail I could provide so I can think about it
and get back to you.
Also, as of now I'm not able to think of any stretch goals with the project
but I'll let you know if I come up with something worth the while!

> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160314/05dcc7de/attachment-0001.html>

James Aylett

2016-Mar-14 11:42 UTC

head link

GSOC-2016 Project : Clustering of search results

On Mon, Mar 14, 2016 at 02:09:13AM +0530, Richhiey Thomas wrote:
> The way the paper has been written I guess is the main source of your
> confusion. Let me provide a paper that explains this same concept in a way
> that is easier to understand. I was confused by eq (3) that you mentioned
> too. Here it is :
> http://www.sau.ac.in/~vivek/softcomp/clustering%20PSO+K-means.pdf
Ah, that's helpful -- thanks!
> The whole point of this approach is to use global search of PSO and fast
> convergence of K-means together.
> So we take every particle and assign it k randomly initialized cluster
> centroids.
>From the above paper, it sounds like they are randomly chosendocuments as the initial centroids.
> For every particle, we consider the cluster centroids that we have assigned
> to that particle and assign all other documents to the closest of these
> cluster centroids.
This is one of the details that I think escaped me when reading the
first paper. It's much clearer in the above paper.
> Through this, we can obtain a fitness value which is
> calculated by finding the summation of distance between all the documents
> and all cluster centroids and taking its reciprocal or making some other
> fitness function that we can define to suit or needs better. Using this
> fitness function, the next location can be found for the particle.
By moving each centroid for each particle, using the PSO iteration
equation?
> > As I understand it, though, there are still some parameters to this
> > algorithm: the maximum iterations, the inertia weight and two
> > 'gravity' constants for the velocity calculation in (1), and
K. Is
> > that correct? Is there anything we can do to predict a useful value
> > for any of them?
> 
> I dont know of any ways to predict an optimum value for any of these
> parameters though I'm sure there will be some mathematical way to do
so.
> But I think since we know how each of these parameters would affect the
> clustering, we can find the correct values that work well when we test this
> algorithm against the kinds of datasets this algorithm will deal with later
> on.
If there are good default values that will work for most people, that
would be fine (that's what we do for the parameters to BM25, for
instance; in more extreme situations you have to change from the
defaults, but most of the time you can just accept them and get
reasonable behaviour). If the documentation can give an idea of when
you'd want to change from the defaults, that would be even better.
> > > May 23 - June 6 (Official coding period starts now) :
> > > Define all the classes and functions that will be necessary for
the
> > > implementation of the project
>
> I will be having my university exams in this period and hence, the
> amount of time I will be able to devote to the project will be kind
> of restricted during these two weeks. That's the only reason I let
> this part go slow. It still shouldn't take that much time, but just
> incase it does, the project wont lag behind.
Okay, that's obviously fine -- just note that in your timeline
alongside what you want to get done during that period. (Which might
be nothing. There's some benefit to just concentrating on your exams
and coming back fresh afterwards!)
> To break things down further, we can look at things like this
> June 7-June 10
> Create a class to represent particles and write code to initialize each
> particle in the dataset with cluster-centroids
> June 11 - June 16:
> Implement code to find the fitness value for each particle and finding the
> new location of the particle.
> June 17-June 20:
> Test the code against any dataset which contains documents so that we can
> see how the code will be finding the initial centroids and so that we can
> tweak the initial parameters to get good performance.
It'd be good to break the tests into two groups: one very fine
grained, testing individual pieces of functionality (such as the
fitness calculation, and calculating the new particle location); one
higher level, testing the overall algorithm. Depending on how your
classes work, the fine grained tests may not need to know about
Xapian::Document, which might make them easier to write.

Either way, I'd strongly recommend writing the fine grained tests
either before or at the same time as the code they test.
> Improve functionality was just meant to say that incase there were portions
> that werent implemented completely during the above phase, it can be
> completed now. If not, this period can be completely devoted to writing
> test cases, or taking care of exception handling and bug removal, to make
> sure the code written is in good shape,
And again: if you write the tests as you go along, you don't need to
think like this. Writing tests afterwards tends to result in less good
tests, because you'll believe that the code already works. Writing the
tests earlier means you're using the tests to tell you when the code
works, ie when the work is complete.

It's fine to have some time in the timeline for catching up and making
sure everything's fine, though. I'm just trying to nudge you towards
writing tests early rather than late :-)
> June 28 - July 4:
> Using cluster centroid vectors from PSO and implement assigning documents
> to the closest centroids.
Surely you'll have a way of assigning documents to the closest
centroids in a particle from the PSO work earlier? So carrying the
best particle across as the initial centroid set for K-means should be
fairly straightforward.
> July 4 - July 12:
> Reassigning cluster centroids and finishing up whatever is left in the
> complete implementation of the algorithtm
>
> I think this period will be essential for writing some test cases,
> concentrating on making sure no bugs creep in and on keeping the code
> indented and clean and exception free.
On things like code indentation, you may want to look at
clang-format. It may take a little while to get a format definition
file that matches Xapian's code layout, but you can then run it
regularly so you aren't spending ages fixing things up later. (The
sooner you fix any indentation issues, the faster you'll get familiar
with indenting things in the style of Xapian.)

(We don't have a clang-format file at the moment, but if you were to
make one as part of this project, we could include it for people in
future.)
> I've tried to breakdown the timeline out here but I'm sure that
this is
> still not perfect enough. So I'd love it if you could tell me where I
am
> going wrong and what more detail I could provide so I can think about it
> and get back to you.
What you have now is a lot better than your previous timeline --
hopefully you feel that it's more structured as well. I think at this
point the best move is to update your draft, and (in a few hours) open
an application on the GSoC website, at which point we can discuss
things further.
> Also, as of now I'm not able to think of any stretch goals with the
project
> but I'll let you know if I come up with something worth the while!
If you don't, I wouldn't worry about it too much. You could spend more
time evaluating options (including maybe looking at cosine instead of
euclidian distance), which would result in a more comprehensive
explanation of the options for future users. And you may discover
other ideas while actually doing the project.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Xapian devel - Mar 2016 - GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results

GSOC-2016 Project : Clustering of search results