Some notes on the evaluation of clustering.
It might be difficult to get hold of an adhoc retrieval dataset with
clustering metadata. So we should test it on clustering dataset.
There are a few publicly available, popular ones are
- 20-newsgroup: http://qwone.com/~jason/20Newsgroups/
- Reuters 21578:
https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
The evaluation framework can be to query each document and retrieve a set
of documents from the corpus and measure Purity, NMI and Rand Index from
the clusters.
P.
On Fri, Aug 19, 2016 at 2:28 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On 18 Aug 2016, at 23:59, Richhiey Thomas <richhiey.thomas at
gmail.com>
> wrote:
>
> > I've currently added a few classes which don't really belong
to the
> public API (currently) into private headers and used PIMPL with the Cluster
> class.
>
> I'm having difficulty reading your changes, because you aren't
keeping to
> one complete change per commit. So for instance you've added a new
> Cluster::Internal class in one commit, but started using it in a different
> commit in which you also make a number of other changes. This makes it much
> harder to see whether there are problems in what you've done.
(Similarly,
> you add new similarity classes in one commit, and in the jumbo commit
> remove them from where they came from and start using the new ones,
> possibly.)
>
> You also still haven't integrated Clusterer fully; RoundRobin
doesn't
> inherit from it. There may be other similar issues left unaddressed, but I
> don't have time right now to work through the jumbo commit and match it
> against the work in the other commits.
>
> > The PR failure is because of the old tests which I had written for
> testing the old API. I'll have to write completely new tests because
the
> API has changed dramatically after mid terms. I'll get that fixed soon
by
> writing a few test cases for the newly implemented functionality.
>
> It's also because cluster.cc won't compile under clang because of
an
> equality / assignment error, and under gcc because there are warnings about
> members shadowed by local variables. You should have got one or the other
> while compiling yourself; what version or which compiler are you using?
>
> > Currently, the main classes which have data to hide are Cluster,
> ClusterSet and Clusterer subclass (currently KMeans). Thus, if we can use
> PIMPL with these classes, it could hide quite a lot of the non-public data.
> As a plus point, I'm having problems with shifting PointType, Point and
> Centroid classes to a private header because of forward referencing
> problems, so even these problems can be solved if KMeans is to go with
> PIMPL.
>
> As I think I've said before, you should concentrate on getting the PR
> building and in good shape before going further with PIMPL or similar. An
> API that has extra bits but works and is tested is much better than a
> tidier API that isn't feature complete.
>
> You have actually tidied up the KMeans API considerably just by moving
> some internal methods out of public visibility. Although in future we may
> be able to hide them completely in a private implementation, it's
neater
> now and more obvious how to use the class.
>
> > According to me, it doesn't even affect the interface that the
Clusterer
> provides because its just an interface. We can still hook up newer
> algorithms, which may not want to use PIMPL later. (I maybe wrong here)
>
> Because Clusterer doesn't use PIMPL, users are free to subclass it
freely.
> (And use PIMPL or not.) Olly's concern on IRC was that using PIMPL for
> KMeans would prevent subclassing that. At the moment that's probably
moot,
> because it's not currently structured to make subclassing it practical
> anyway.
>
> Even once you have a green PR that's ready to merge and are looking for
> further things to work on, I wouldn't worry too much about removing
> PointType, Point & Centroid from the public API. I have a feeling that
when
> we look at further ways of initialising the KMeans centroids we'll want
to
> make that a public API rather than a selector (mode="random",
> mode="kmeanspp") as it is now. That will involve exposing some
part of that
> set of classes to make that API. (There are a number of ways of
> initialising them in the literature already, and it would be nice to allow
> users to experiment with their own.)
>
> J
>
> --
> James Aylett
> devfort.com ? spacelog.org ? tartarus.org/james/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160819/c52eedb9/attachment-0001.html>