thr3ads.net - Xapian devel - KMeans - Evaluation Results [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Ayush Tomar

2016-Aug-18 20:07 UTC

KMeans - Evaluation Results

>
>
>
> Actually, you're doing something slightly unusual there: making the
> internal member public. Protected would be better, and private is I think
> most usual; library clients aren't going to have access to the Internal
> class declaration, so they can't call things on it. This means it's
> actually difficult right now to subclass Feature.
>
> I don't think we should worry about this with LTR right now, but I
suspect
> a better approach is to pass the FeatureList (as a helper) to
> Feature::get_values(), or set it when we add the Feature to FeatureList,
> and use that to ask for things like the current query, current document,
> normalised lengths &c.
>
I see. What I've put in Feature::Internal can actually be part of
FeatureList class (maybe as FeatureList::Internal). For the time being,
I'll at least make the internal member in Feature as protected, and we can
take this up later when we are getting rid of Letor class and putting
everything under Ranker.
>
> It's possible that we could also do the thing Weight does, and have
each
> Feature report which statistics it needs, so FeatureList can calculate them
> each once in advance before calling get_values() on all its Feature
> objects. (Weight objects I think get the statistics poked onto them, but
>
This is a good idea.
(words after "but" are somehow missing, could you please resend?)

>
> J
>
> --Ayush
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160818/95532217/attachment.html>

Richhiey Thomas

2016-Aug-18 22:59 UTC

head link

KMeans - Evaluation Results

Hello,

Ayush, thanks for that link to your code. It helped me understand how I
could implement things much quicker!

I've currently added a few classes which don't really belong to the
public
API (currently) into private headers and used PIMPL with the Cluster class.

The PR failure is because of the old tests which I had written for testing
the old API. I'll have to write completely new tests because the API has
changed dramatically after mid terms. I'll get that fixed soon by writing a
few test cases for the newly implemented functionality.

Currently, the main classes which have data to hide are Cluster, ClusterSet
and Clusterer subclass (currently KMeans). Thus, if we can use PIMPL with
these classes, it could hide quite a lot of the non-public data. As a plus
point, I'm having problems with shifting PointType, Point and Centroid
classes to a private header because of forward referencing problems, so
even these problems can be solved if KMeans is to go with PIMPL.

According to me, it doesn't even affect the interface that the Clusterer
provides because its just an interface. We can still hook up newer
algorithms, which may not want to use PIMPL later. (I maybe wrong here)

Current update - I'll have the PR in good shape by tomorrow by fixing the
travis build and adding in a little more PIMPL.

Thanks.

Regards,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160819/3c0a3889/attachment.html>

James Aylett

2016-Aug-19 08:17 UTC

head link

KMeans - Evaluation Results

On 18 Aug 2016, at 21:07, Ayush Tomar <ayushtomar at gmail.com> wrote:
>> It's possible that we could also do the thing Weight does, and have
each Feature report which statistics it needs, so FeatureList can calculate them
each once in advance before calling get_values() on all its Feature objects.
(Weight objects I think get the statistics poked onto them, but
> 
> This is a good idea.
> (words after "but" are somehow missing, could you please resend?)
Not sure I ever wrote them, and can't remember now what I wanted to say. Not
sure there is a "but"; that approach should be considered too.

J

-- 
 James Aylett, +44 7976 212023
 Principal Consultant, Kittens with Jetpacks Ltd

 Registered company no 6587395
 19 Cottrell Court, Southern Way, London SE10 0DW

James Aylett

2016-Aug-19 08:58 UTC

head link

KMeans - Evaluation Results

On 18 Aug 2016, at 23:59, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> I've currently added a few classes which don't really belong to the
public API (currently) into private headers and used PIMPL with the Cluster
class.
I'm having difficulty reading your changes, because you aren't keeping
to one complete change per commit. So for instance you've added a new
Cluster::Internal class in one commit, but started using it in a different
commit in which you also make a number of other changes. This makes it much
harder to see whether there are problems in what you've done. (Similarly,
you add new similarity classes in one commit, and in the jumbo commit remove
them from where they came from and start using the new ones, possibly.)

You also still haven't integrated Clusterer fully; RoundRobin doesn't
inherit from it. There may be other similar issues left unaddressed, but I
don't have time right now to work through the jumbo commit and match it
against the work in the other commits.
> The PR failure is because of the old tests which I had written for testing
the old API. I'll have to write completely new tests because the API has
changed dramatically after mid terms. I'll get that fixed soon by writing a
few test cases for the newly implemented functionality.
It's also because cluster.cc won't compile under clang because of an
equality / assignment error, and under gcc because there are warnings about
members shadowed by local variables. You should have got one or the other while
compiling yourself; what version or which compiler are you using?
> Currently, the main classes which have data to hide are Cluster, ClusterSet
and Clusterer subclass (currently KMeans). Thus, if we can use PIMPL with these
classes, it could hide quite a lot of the non-public data. As a plus point,
I'm having problems with shifting PointType, Point and Centroid classes to a
private header because of forward referencing problems, so even these problems
can be solved if KMeans is to go with PIMPL.
As I think I've said before, you should concentrate on getting the PR
building and in good shape before going further with PIMPL or similar. An API
that has extra bits but works and is tested is much better than a tidier API
that isn't feature complete.

You have actually tidied up the KMeans API considerably just by moving some
internal methods out of public visibility. Although in future we may be able to
hide them completely in a private implementation, it's neater now and more
obvious how to use the class.
> According to me, it doesn't even affect the interface that the
Clusterer provides because its just an interface. We can still hook up newer
algorithms, which may not want to use PIMPL later. (I maybe wrong here)
Because Clusterer doesn't use PIMPL, users are free to subclass it freely.
(And use PIMPL or not.) Olly's concern on IRC was that using PIMPL for
KMeans would prevent subclassing that. At the moment that's probably moot,
because it's not currently structured to make subclassing it practical
anyway.

Even once you have a green PR that's ready to merge and are looking for
further things to work on, I wouldn't worry too much about removing
PointType, Point & Centroid from the public API. I have a feeling that when
we look at further ways of initialising the KMeans centroids we'll want to
make that a public API rather than a selector (mode="random",
mode="kmeanspp") as it is now. That will involve exposing some part of
that set of classes to make that API. (There are a number of ways of
initialising them in the literature already, and it would be nice to allow users
to experiment with their own.)

J

-- 
 James Aylett
 devfort.com ? spacelog.org ? tartarus.org/james/

Seemingly Similar Threads

Search for more reasonably related threads

Xapian devel - Aug 2016 - KMeans - Evaluation Results

KMeans - Evaluation Results

KMeans - Evaluation Results

KMeans - Evaluation Results

KMeans - Evaluation Results

Seemingly Similar Threads