On 15 Aug 2016, at 22:05, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> I have tested this implementation of KMeans with a BBC news article
dataset.
You mentioned FIRE as another dataset to test with, but I don't think it has
any pre-clustered datasets. There may be something helpful in another dataset.
However if the BBC dataset can be fairly trivially clustered manually (which you
suggested it could) then that could be used to compute some external evaluation
measures. I don't know much about evaluating clustering, but maybe
Fowlkes-Mallows would be a good starting point?
There are also some internal evaluations that might be worth looking at, like
the silhouette coefficient.
In all cases, you need to compare against something for the numbers to mean
much. It's probably worth setting up a Google spreadsheet or similar to
track these: then you can build up different coefficients calculated for KMeans,
KM++ seeded KMeans, and round robin clustering to see how they behave.
(This becomes important in future when looking at clustering or pre-clustering
approaches that have threshold parameters, so we can evaluate performance as we
tweak the defaults.)
> When dealing with an MSet that contains around 200 - 300 documents,
clustering is efficient and usually gives sub-optimal clustering solutions
(average distance between document and centroid = 0.65 - 0.75 in cosine
similarity). For MSet's containing more documents, the time seems to
increase steadily.
How long does 200?300 documents take to cluster? How does it grow as more
documents are included in the MSet? We'd expect an MSet of 1000 documents to
take longer to cluster than one with 100, but the important thing is _how_ the
time increases as the number of documents grows.
> KMeans++ results:
> KMeans++ provides slightly superior results due to spreading the cluster
centroids out. But this becomes a problem when the document vectors are not
spread uniformly and concentrate in a few places, because the cluster sizes end
up being out of proportion.
Surely that's the right behaviour for that kind of data? (Although AIUI
KMeans is supposed to be that good in that situation: is that what you mean?)
> With respect to optimization, I'm currently using unordered_map where
ever I'm requiring to map values. It certainly works faster than the
std::map but there are various hashmap implementations that can work faster.
Google dense hash map is one of those and they work way faster than
unordered_map (to my knowledge). I thought of using them to speed value look
ups, but would it affect the portability of the code? Or will it be possible for
me to use it?
There are a number of considerations in using other code or libraries:
1. What's the license? If it's incompatible with GPL, then you can't
use it, period. Also, if it's code that we want to copy into the source tree
(rather than a library), under a license that is compatible with the GPL but
can't be relicensed as MIT/X license, that's also a problem because of
the licensing direction we want to go in
(https://github.com/xapian/xapian/blob/master/xapian-core/HACKING#L1427).
2. If the idea is to copy code in, then we need to ensure that we respect
copyright notices as well as the license. We also generally don't want to
copy code in that is under active development, or we'll constantly be
updating our copy of it, which is both inefficient and inelegant.
3. If it's a library, then it needs to be easy to install on a wide range of
systems. If not, it may be worth considering whether we can optionally use it,
so if it's not installed we fall back to a slower version backed by the STL.
4. In all cases, there's a possible portability concern. Some projects will
give a good idea of their portability, either by stating their aims in that
respect, or with some sort of test matrix.
Before you dive in, I'd also ask yourself explicitly what problem you are
trying to solve. There may be other ways of thinking about it that lead you to
different data structures or algorithms for working with them. It's not
always best to continually look for more efficient implementations of the same
data structure.
For instance, even with sparse vectors, it may be that compact storage is more
CPU efficient without taking too much memory. Maybe there's some dimensional
reduction work that can be done before or while creating the document vectors
(eliminating non-discriminative terms, or preprocessing the database to find
terms that are close to linearly-dependent, for instance; there are a wealth of
options to look at here, though ? I know that a previous GSoC project on
clustering looked at dimensionality reduction in optimising the previous
clustering branch).
> I would like to know what documentation I would need to write for the
current clustering API so that I can start with that too,
The classes need documenting in header files, on which you're partway there:
at least every non-private member should be fully documented, and every class
given an overall documentation comment that explains its usage.
Anything which isn't part of the public API should not be in the external
header file, but still needs documenting so those working on the library in
future know what they're doing.
Then we need an introduction to using the clustering system in the user guide
(https://getting-started-with-xapian.readthedocs.io/en/latest/). That should be
a new document in the "how to" part of the guide.
> Also, since the final week for GSoC is starting now, should I list the API
that was merged into a different branch before as merged or as yet to merge? And
incase I am not able to complete PSO by 23rd august since I am behind the
timeline, would it be possible for me to continue it later on?
You should point to the work that was merged, and additionally to any pull
requests on top of that. As we merge those pull requests, it will still be
obvious to people used to using github what's going on.
I think in practice you should concentrate on everything other than PSO:
finishing the optimisations, writing the documentation, getting the PR in the
best possible shape ready for merge, and writing up the evaluations. You can of
course continue to work on PSO (or anything else!) after GSoC is ended.
We're always happy when former GSoC students stay a part of Xapian, and
it's one of our goals of doing it :-)
J