thr3ads.net - Xapian devel - GSoC 2016 - Introduction [May 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-May-01 16:23 UTC

GSoC 2016 - Introduction

Before going ahead with the tests as you mentioned above, I would just like
to clarify a few higher level things that I am still in doubt about.

1) As discussed during the IRC interview, I was suggested about first
implementing a normal K-means clustering implementation and then adding on
the PSO module as a functionality that can be used to improve quality of
clustering for speed as a trade off. This is the way I should see the
project, right?

2) Isn't it easier to first think about the API for the clustering
functionality rather then deriving it through test cases? (I haven't been
used to thinking like this so it gets kind of hard to think in reverse). Do
correct me if writing tests before is the better way.

3) The fitness measure I plan to use for the PSO part and also for
evaluating the clustering results is ADDC (average distance of documents to
the cluster centroid). Is this the best fit?

4) For parameters in K-means and PSO, default values can be set which can
be overridden in a special use case?

5) There is already a clustering branch that was created before. Do I have
to continue work with the existing implementation or do I start afresh?

Currently I'm looking at the previous clustering branch and the test API
and getting used to the things I am not familiar with in the codebase. Once
I am confident, I'll go ahead with a simple test for the clustering as you
suggested.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160501/c1a4725d/attachment.html>

James Aylett

2016-May-02 16:48 UTC

head link

GSoC 2016 - Introduction

On Sun, May 01, 2016 at 09:53:58PM +0530, Richhiey Thomas wrote:
> Before going ahead with the tests as you mentioned above, I would
> just like to clarify a few higher level things that I am still in
> doubt about.
Hi Richhiey.
> 1) As discussed during the IRC interview, I was suggested about first
> implementing a normal K-means clustering implementation and then adding on
> the PSO module as a functionality that can be used to improve quality of
> clustering for speed as a trade off. This is the way I should see the
> project, right?
Absolutely. You should be aiming to get a small, useful, piece of work
merged. Obviously clustering isn't a tiny piece of work, but K-means
clustering on its own is smaller than PSO+K-means.

Also, building them as separate parts might also make it easier in
future to experiment with other algorithms and other combinations,
perhaps helping to verify for our problem domains the paper's claims
of performance. But if you build K-Means first you don't even have to
worry about that at the beginning, and can aim to get something merged
into master before you have to think about it.
> 2) Isn't it easier to first think about the API for the clustering
> functionality rather then deriving it through test cases? (I haven't
been
> used to thinking like this so it gets kind of hard to think in reverse). Do
> correct me if writing tests before is the better way.
To design a good API you need to think like a user of the API. To
write tests of an API you need to write code like a user of the API,
but with a bit more knowledge of where the boundary conditions
are. You can do a draft of the API without writing any code, but you'd
then expect it to change as you write code that uses it.

So writing tests early is a good idea. Another useful thing to do
would be to write sample code (perhaps in the form of a quick 'how to'
page for the getting started guide).

Probably the most important thing is not to worry too much about the
first draft of the API, and expect that it will change based both on
your experience writing code that uses it and feedback from the
community. So in the first instance I'd take whatever route you think
is going to enable you best to communicate how you're thinking about
the API.
> 3) The fitness measure I plan to use for the PSO part and also for
> evaluating the clustering results is ADDC (average distance of documents to
> the cluster centroid). Is this the best fit?
For PSO I don't know, and to an extent it's going to be an experiment
to find out how well anything performs.

For evaluating you can probably use ADDC if you also use some
inter-cluster metric as well (distance to nearest centroid, or
distance to nearest member of another cluster, perhaps).

Deciding which clusters are 'good' is going to be somewhat of an art,
partly because different uses will demand different outcomes. Some
people may want very tight clusters, some may be more concerned about
having distinct gaps in the vector space between clusters. (Others may
want clusters of roughly equal magnitude.)

Again, this is something that will come after landing a working
K-means system, so we can keep discussing it and throwing out ideas
during the early parts of the project.
> 4) For parameters in K-means and PSO, default values can be set which can
> be overridden in a special use case?
Yes, that's a good plan. If you don't know what good values are to
start off with (and you probably won't for some of them), just pick
something that makes some kind of sense.
> 5) There is already a clustering branch that was created before. Do I have
> to continue work with the existing implementation or do I start afresh?
No, I'd start afresh. The existing clustering code is from a while
ago, and is quite different to what you're doing.
> Currently I'm looking at the previous clustering branch and the test
API
> and getting used to the things I am not familiar with in the codebase. Once
> I am confident, I'll go ahead with a simple test for the clustering as
you
> suggested.
Beyond looking at how the previous clustering API worked, I wouldn't
spend much time on it. Spending time with the existing API tests so you
understand how they work is definitely helpful, though.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Richhiey Thomas

2016-May-05 20:59 UTC

head link

GSoC 2016 - Introduction

Hello,

Thanks James for the reply. That cleared a few things out. Apologies for
replying late because of exams going on.

I was going through the previous clustering API to understand how it worked
and it seems like the the approach for construction of the termlists which
are used for distance metrics use TF-IDF weighting with cosine similarity,
which is very similar to the approach I would need for this project. Just
in this case, euclidian distance would be the metric.

Would it be good to structure it in a way similar to the previous API with
a few changes?

For example, the Xapian::DocSimCosine::similarity( ) function in itself
calculates the tf idf vectors and calculates the similarity. Instead would
it be possible to have a custom weighting scheme sub classing
Xapian::Weight? This can help in providing the user an option about which
weighting scheme to use to create document vectors in K-means.

More ways of creating document sources should be allowed, for example from
a vector of docid's that the user has.

I have also been looking at the existing test API and I'll create a new PR
for a simple test in the next 1-2 days, maybe for checking whether the
value of k is valid or checking the euclidian distance calculations for
document vectors.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160506/5efc4814/attachment.html>

Xapian devel - May 2016 - GSoC 2016 - Introduction

GSoC 2016 - Introduction

GSoC 2016 - Introduction

GSoC 2016 - Introduction