Hello, Now work on stopword removal and stemming is almost ending and the run time for KMeans seem to be getting lesser (around 0.15 s for 100 documents and this increases to around 1.2 s with 500 documents and 2.5 s with 1000 documents). I tried this out on the BBC datasets available with a value k=5, since there were 5 categories in the dataset. Going forward, the next step to optimize KMeans is to use the faster optimized version of KMeans which reduces distance computations developed by Charles Elkan. For this, I will be providing the user an option to specify with the constructor whether they would want the standard algorithm or Elkans algorithm. and write a method within KMeans to implement the triangle inequality optmization. I will also be moving RoundRobin to the testsuite. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170723/da5d7e92/attachment.html>
On 23 Jul 2017, at 20:50, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> Now work on stopword removal and stemming is almost ending and the run time for KMeans seem to be getting lesser (around 0.15 s for 100 documents and this increases to around 1.2 s with 500 documents and 2.5 s with 1000 documents). I tried this out on the BBC datasets available with a value k=5, since there were 5 categories in the dataset. > > Going forward, the next step to optimize KMeans is to use the faster optimized version of KMeans which reduces distance computations developed by Charles Elkan. For this, I will be providing the user an option to specify with the constructor whether they would want the standard algorithm or Elkans algorithm. and write a method within KMeans to implement the triangle inequality optmization. I will also be moving RoundRobin to the testsuite.Which of the Elkan algorithm and triangle inequality do you expect to have a bigger impact on the runtime? Because it'd be great to do that one first. (RoundRobin you should move in its own small PR.) J -- James Aylett devfort.com — spacelog.org — tartarus.org/james/
On Sun, Jul 23, 2017 at 03:50:51PM -0400, Richhiey Thomas wrote:> Going forward, the next step to optimize KMeans is to use the faster > optimized version of KMeans which reduces distance computations developed > by Charles Elkan. For this, I will be providing the user an option to > specify with the constructor whether they would want the standard algorithm > or Elkans algorithm.Is having a choice actually useful to the API user? Unless the optimised algorithm is sometimes slower (and in a way that you can usefully predict up front), or uses significantly more memory, then this just seems like pointless clutter in the API and extra code in the library. Cheers, Olly