thr3ads.net - Xapian devel - KMeans - Going forward [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2017-Jul-23 19:50 UTC

KMeans - Going forward

Hello,

Now work on stopword removal and stemming is almost ending and the run time
for KMeans seem to be getting lesser (around 0.15 s for 100 documents and
this increases to around 1.2 s with 500 documents and 2.5 s with 1000
documents). I tried this out on the BBC datasets available with a value
k=5, since there were 5 categories in the dataset.

Going forward, the next step to optimize KMeans is to use the faster
optimized version of KMeans which reduces distance computations developed
by Charles Elkan. For this, I will be providing the user an option to
specify with the constructor whether they would want the standard algorithm
or Elkans algorithm. and write a method within KMeans to implement the
triangle inequality optmization. I will also be moving RoundRobin to the
testsuite.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170723/da5d7e92/attachment.html>

James Aylett

2017-Jul-23 20:43 UTC

head link

KMeans - Going forward

On 23 Jul 2017, at 20:50, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> Now work on stopword removal and stemming is almost ending and the run time
for KMeans seem to be getting lesser (around 0.15 s for 100 documents and this
increases to around 1.2 s with 500 documents and 2.5 s with 1000 documents). I
tried this out on the BBC datasets available with a value k=5, since there were
5 categories in the dataset.
> 
> Going forward, the next step to optimize KMeans is to use the faster
optimized version of KMeans which reduces distance computations developed by
Charles Elkan. For this, I will be providing the user an option to specify with
the constructor whether they would want the standard algorithm or Elkans
algorithm. and write a method within KMeans to implement the triangle inequality
optmization. I will also be moving RoundRobin to the testsuite.
Which of the Elkan algorithm and triangle inequality do you expect to have a
bigger impact on the runtime? Because it'd be great to do that one first.

(RoundRobin you should move in its own small PR.)

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

Olly Betts

2017-Jul-24 01:26 UTC

head link

KMeans - Going forward

On Sun, Jul 23, 2017 at 03:50:51PM -0400, Richhiey Thomas
wrote:> Going forward, the next step to optimize KMeans is to use the faster
> optimized version of KMeans which reduces distance computations developed
> by Charles Elkan. For this, I will be providing the user an option to
> specify with the constructor whether they would want the standard algorithm
> or Elkans algorithm.
Is having a choice actually useful to the API user?

Unless the optimised algorithm is sometimes slower (and in a way that
you can usefully predict up front), or uses significantly more memory,
then this just seems like pointless clutter in the API and extra code in
the library.

Cheers,
    Olly

Xapian devel - Jul 2017 - KMeans - Going forward

KMeans - Going forward

KMeans - Going forward

KMeans - Going forward