Hello,
I found a severe bug in the function kmeanspp (k-means++ implementation) in the
LICORS package
(https://www.rdocumentation.org/packages/LICORS/versions/0.2.0/topics/kmeanspp)
and need some advice on how to handle this (therefore posting here).
?
Since LICORS has not been updated since 2013, I am not sure if there is an
active maintainer. The named maintainer works for Google since 2014 and may have
abandoned the package.
?
On the other?hand, this is one of the few implementations of k-means++ in R and
coming up first when searching.
?
The bug leads to inferior results. In its current form, the results were much
worse than those of Hartigan-Wong (the default for k-means in the stats package)
for all test problems I tried. However, after fixing the bug, kmeanspp found
better results than Hartigan-Wong for all those problems. So anyone comparing
those two algorithms based on the current implementations in R may have come to
the wrong conclusions. BTW: The Hartigan-Wong implementation (Fortran) is *much*
faster than LICOR/kmeanspp, which is written in pure R (before making one call
to stats/kmeans at the end), but that is not the point here.
?
The bug concerns a distance computation which should be a matrix of distances of
all data vectors and all current codebook vectors, but is not. The code and an
example illustrating the problem is shown below. Basically, to subtract a vector
from a matrix,?one has to convert the vector into a matrix where all rows are
just copies of the vector. Details are shown below. The fix is trivial.
?
I stumbled upon this because kmeanspp produced counterintuitive and very poor
results.
?
Is this of any interest? Should kmeanspp in LICORS be fixed? I have neither the
experience nor the time to take over an R-package. I could write a more formal
bug report, if required, but I initially would like to know if this is
considered relevant.
Suggestions/comments are welcome.
?
Kind Regards
Bernd Fritzke
?
The code?
? ? ? ? ? ? if (ndim == 1) {
? ? ? ? ? ? ? ? dists <- apply(cbind(data[center_ids, ]), 1,?
? ? ? ? ? ? ? ? ? function(center) {
? ? ? ? ? ? ? ? ? ? rowSums((data - center)^2)
? ? ? ? ? ? ? ? ? })
? ? ? ? ? ? }
? ? ? ? ? ? else {
? ? ? ? ? ? ? ? dists <- apply(data[center_ids, ], 1, function(center) {
? ? ? ? ? ? ? ? ? rowSums((data - center)^2)
? ? ? ? ? ? ? ? })
? ? ? ? ? ? }
?
should rather be (the two changed lines are marked as "fixed"):
?
? ? ? ? ? ? if (ndim == 1) {
? ? ? ? ? ? ? ? dists <- apply(cbind(data[center_ids, ]), 1,?
? ? ? ? ? ? ? ? ? function(center) {
? ? ? ? ? ? ? ? ? ? rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
? ? ? ? ? ? ? ? ? })
? ? ? ? ? ? }
? ? ? ? ? ? else {
? ? ? ? ? ? ? ? dists <- apply(data[center_ids, ], 1, function(center) {
? ? ? ? ? ? ? ? ? rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
? ? ? ? ? ? ? ? })
? ? ? ? ? ? }
?
Here is some example code illustrating the problem. The code should compute the
square distances between the six two-dimensional vectors in "data" and
three vectors which happen to be the elements 1, 2, and 4 in "data".
RSTUDIO
```{r}
data=cbind(1:6,rep(0,6))
print("data")
print(data)
center_ids=c(1,2,4)
print("codebook")
print(data[center_ids, ])
# fixed
dists <- apply(data[center_ids, ], 1,?
? ? ? ? ? ? ? ?function(center) {
? ? ? ? ? ? ? ? ? rowSums((data - rep(center,each=nrow(data)))^2)
? ? ? ? ? ? ? }
)
print("dists (correct)")
print(dists)
# buggy
dists <- apply(data[center_ids, ], 1,?
? ? ? ? ? ? ? ?function(center) {
? ? ? ? ? ? ? ? ? rowSums((data - center)^2)
? ? ? ? ? ? ? }
)
print("buggy dists from LICORS")
dists
```
Output:
[1] "data"
? ? ?[,1] [,2]
[1,] ? ?1 ? ?0
[2,] ? ?2 ? ?0
[3,] ? ?3 ? ?0
[4,] ? ?4 ? ?0
[5,] ? ?5 ? ?0
[6,] ? ?6 ? ?0
[1] "codebook"
? ? ?[,1] [,2]
[1,] ? ?1 ? ?0
[2,] ? ?2 ? ?0
[3,] ? ?4 ? ?0
[1] "dists (correct)"
? ? ?[,1] [,2] [,3]
[1,] ? ?0 ? ?1 ? ?9
[2,] ? ?1 ? ?0 ? ?4
[3,] ? ?4 ? ?1 ? ?1
[4,] ? ?9 ? ?4 ? ?0
[5,] ? 16 ? ?9 ? ?1
[6,] ? 25 ? 16 ? ?4
[1] "buggy dists from LICORS"
? ? ?[,1] [,2] [,3]
[1,] ? ?1 ? ?5 ? 25
[2,] ? ?4 ? ?4 ? ?4
[3,] ? ?5 ? ?5 ? 17
[4,] ? 16 ? 16 ? 16
[5,] ? 17 ? 13 ? 17
[6,] ? 36 ? 36 ? 36
?
--
Dr.-Ing. Bernd Fritzke