thr3ads.net - Xapian devel - KMeans - Evaluation Results [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-Aug-15 21:05 UTC

KMeans - Evaluation Results

Hello,

I've recently finished with an implementation of KMeans with two
initialization techniques, random initialization and KMeans++. I would like
to share my findings after evaluating the same.

I have tested this implementation of KMeans with a BBC news article
dataset. I am currently working on evaluating the same with FIRE datasets.
Currently, clustering more than 500 documents efficiently is taking time (a
little more than 5 seconds) and I'm trying to optimize for time by finding
methods for early convergence or simple code profiling.

The BBC news article dataset contains articles related to politics, sports,
tech, entertainment and business and contains 2225 documents in all.

Random KMeans results:
When dealing with an MSet that contains around 200 - 300 documents,
clustering is efficient and usually gives sub-optimal clustering solutions
(average distance between document and centroid = 0.65 - 0.75 in cosine
similarity). For MSet's containing more documents, the time seems to
increase steadily.

KMeans++ results:
KMeans++ provides slightly superior results due to spreading the cluster
centroids out. But this becomes a problem when the document vectors are not
spread uniformly and concentrate in a few places, because the cluster sizes
end up being out of proportion.
The average distance between document and centroid is lesser than that of
KMeans mostly (around 0.55-0.7). Thus, in many cases, the clusters end up
being more compact.

I'm continuing to optimize this solution in time before I can go ahead with
PSO.

With respect to optimization, I'm currently using unordered_map where ever
I'm requiring to map values. It certainly works faster than the std::map
but there are various hashmap implementations that can work faster. Google
dense hash map is one of those and they work way faster than unordered_map
(to my knowledge). I thought of using them to speed value look ups, but
would it affect the portability of the code? Or will it be possible for me
to use it?

I would like to know what documentation I would need to write for the
current clustering API so that I can start with that too,

Also, since the final week for GSoC is starting now, should I list the API
that was merged into a different branch before as merged or as yet to
merge? And incase I am not able to complete PSO by 23rd august since I am
behind the timeline, would it be possible for me to continue it later on?

Thanks.

Regards,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160816/a94f3ed1/attachment.html>

James Aylett

2016-Aug-16 14:19 UTC

head link

KMeans - Evaluation Results

On 15 Aug 2016, at 22:05, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> I have tested this implementation of KMeans with a BBC news article
dataset.
You mentioned FIRE as another dataset to test with, but I don't think it has
any pre-clustered datasets. There may be something helpful in another dataset.
However if the BBC dataset can be fairly trivially clustered manually (which you
suggested it could) then that could be used to compute some external evaluation
measures. I don't know much about evaluating clustering, but maybe
Fowlkes-Mallows would be a good starting point?

There are also some internal evaluations that might be worth looking at, like
the silhouette coefficient.

In all cases, you need to compare against something for the numbers to mean
much. It's probably worth setting up a Google spreadsheet or similar to
track these: then you can build up different coefficients calculated for KMeans,
KM++ seeded KMeans, and round robin clustering to see how they behave.

(This becomes important in future when looking at clustering or pre-clustering
approaches that have threshold parameters, so we can evaluate performance as we
tweak the defaults.)
> When dealing with an MSet that contains around 200 - 300 documents,
clustering is efficient and usually gives sub-optimal clustering solutions
(average distance between document and centroid = 0.65 - 0.75 in cosine
similarity). For MSet's containing more documents, the time seems to
increase steadily.
How long does 200?300 documents take to cluster? How does it grow as more
documents are included in the MSet? We'd expect an MSet of 1000 documents to
take longer to cluster than one with 100, but the important thing is _how_ the
time increases as the number of documents grows.
> KMeans++ results:
> KMeans++ provides slightly superior results due to spreading the cluster
centroids out. But this becomes a problem when the document vectors are not
spread uniformly and concentrate in a few places, because the cluster sizes end
up being out of proportion.
Surely that's the right behaviour for that kind of data? (Although AIUI
KMeans is supposed to be that good in that situation: is that what you mean?)
> With respect to optimization, I'm currently using unordered_map where
ever I'm requiring to map values. It certainly works faster than the
std::map but there are various hashmap implementations that can work faster.
Google dense hash map is one of those and they work way faster than
unordered_map (to my knowledge). I thought of using them to speed value look
ups, but would it affect the portability of the code? Or will it be possible for
me to use it?
There are a number of considerations in using other code or libraries:

1. What's the license? If it's incompatible with GPL, then you can't
use it, period. Also, if it's code that we want to copy into the source tree
(rather than a library), under a license that is compatible with the GPL but
can't be relicensed as MIT/X license, that's also a problem because of
the licensing direction we want to go in
(https://github.com/xapian/xapian/blob/master/xapian-core/HACKING#L1427).

2. If the idea is to copy code in, then we need to ensure that we respect
copyright notices as well as the license. We also generally don't want to
copy code in that is under active development, or we'll constantly be
updating our copy of it, which is both inefficient and inelegant.

3. If it's a library, then it needs to be easy to install on a wide range of
systems. If not, it may be worth considering whether we can optionally use it,
so if it's not installed we fall back to a slower version backed by the STL.

4. In all cases, there's a possible portability concern. Some projects will
give a good idea of their portability, either by stating their aims in that
respect, or with some sort of test matrix.

Before you dive in, I'd also ask yourself explicitly what problem you are
trying to solve. There may be other ways of thinking about it that lead you to
different data structures or algorithms for working with them. It's not
always best to continually look for more efficient implementations of the same
data structure.

For instance, even with sparse vectors, it may be that compact storage is more
CPU efficient without taking too much memory. Maybe there's some dimensional
reduction work that can be done before or while creating the document vectors
(eliminating non-discriminative terms, or preprocessing the database to find
terms that are close to linearly-dependent, for instance; there are a wealth of
options to look at here, though ? I know that a previous GSoC project on
clustering looked at dimensionality reduction in optimising the previous
clustering branch).
> I would like to know what documentation I would need to write for the
current clustering API so that I can start with that too,
The classes need documenting in header files, on which you're partway there:
at least every non-private member should be fully documented, and every class
given an overall documentation comment that explains its usage.

Anything which isn't part of the public API should not be in the external
header file, but still needs documenting so those working on the library in
future know what they're doing.

Then we need an introduction to using the clustering system in the user guide
(https://getting-started-with-xapian.readthedocs.io/en/latest/). That should be
a new document in the "how to" part of the guide.
> Also, since the final week for GSoC is starting now, should I list the API
that was merged into a different branch before as merged or as yet to merge? And
incase I am not able to complete PSO by 23rd august since I am behind the
timeline, would it be possible for me to continue it later on?
You should point to the work that was merged, and additionally to any pull
requests on top of that. As we merge those pull requests, it will still be
obvious to people used to using github what's going on.

I think in practice you should concentrate on everything other than PSO:
finishing the optimisations, writing the documentation, getting the PR in the
best possible shape ready for merge, and writing up the evaluations. You can of
course continue to work on PSO (or anything else!) after GSoC is ended.
We're always happy when former GSoC students stay a part of Xapian, and
it's one of our goals of doing it :-)

J

Richhiey Thomas

2016-Aug-17 12:40 UTC

head link

KMeans - Evaluation Results

> How long does 200?300 documents take to cluster? How does it grow as more
> documents are included in the MSet? We'd expect an MSet of 1000
documents
> to take longer to cluster than one with 100, but the important thing is
> _how_ the time increases as the number of documents grows.
>
> Currently, the number of seconds taken for clustering a set of documentsfor varying sizes is :

100 documents - 0.50 s
200 documents - 1.5 s
300 documents - 4.5 s
400 documents - 6.02 s
500 documents - 10.3 s
600 documents - 17.02 s
700 documents - 23.56 s
800 documents - 29.12 s
900 documents - 36.87 s
1000 documents - 42.46 s

Surely that's the right behaviour for that kind of data? (Although
AIUI> KMeans is supposed to be that good in that situation: is that what you
> mean?)
>
>  Yes that's the right kind of behavior for KMeans++. KMeans++ seeding
tootakes almost the same amount of time to converge to a solution.

I'll address the other things you mentioned in your mail soon. Thanks for
the information on the documentation that will be required.

Currently, as you had mentioned that pruning the API for hiding
implementation of things that are not part of the public API is an
important thing to do. So I was looking at how PIMPL has been adopted in
Xapian, and if I'm not wrong, this has been done with the Internal class.
But I hadn't written the API in a way to agree with that design. Any tips
or guidelines I could get in order to make the current API conform with
PIMPL as implemented in Xapian?

Thanks.

Regards,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160817/a5380b4c/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

Xapian devel - Aug 2016 - KMeans - Evaluation Results

KMeans - Evaluation Results

KMeans - Evaluation Results

KMeans - Evaluation Results

Possibly Parallel Threads