javier garcia - CEBAS
2004-Aug-12 10:53 UTC
[R] error using daisy() in library(cluster). Bug?
Hi, I'm using the cluster library to examine multivariate data. The data come from a connection to a postgres database, and I did a short R script to do the analisys. With the cluster version included in R1.8.0, daisy worked well for my data, but now, when I call daisy, I obtain the following messages: --------- Error in if (any(sx == 0)) { : missing value where TRUE/FALSE needed In addition: Warning message: binary variable(s) 116 treated as interval scaled in: daisy(concentracion.data.frame, stand = TRUE) --------- Al the variables in my dataframe are numeric. Although I've got NA values, and I've seen that if a do the analisys for a subset of the dataframe, selecting just columns with no NA, the result is good. Could this be a bug? Thanks, and best regards Javier
??Hola Javier! since I am the maintainer of the cluster *package* (not "library"), I'm interested to find out more about this problem. I assume, you now use R 1.9.1. Can you give us an example we can reproduce? Give the exact R commands you use and maybe attach the save()d data file (*.rda) in a private e-mail? Or do this on R-help and give an URL where one can download the data (you can't attach such binary files for R-help). Thank you, Martin Maechler>>>>> "javier" == javier garcia <- CEBAS <rn001 at cebas.csic.es>> >>>>> on Thu, 12 Aug 2004 12:53:28 +0200 writes:javier> Hi, I'm using the cluster library to examine javier> multivariate data. The data come from a connection javier> to a postgres database, and I did a short R script javier> to do the analisys. With the cluster version javier> included in R1.8.0, daisy worked well for my data, javier> but now, when I call daisy, I obtain the following javier> messages: --------- Error in if (any(sx == 0)) { : javier> missing value where TRUE/FALSE needed In addition: javier> Warning message: binary variable(s) 116 treated as javier> interval scaled in: daisy(concentracion.data.frame, javier> stand = TRUE) --------- javier> Al the variables in my dataframe are javier> numeric. Although I've got NA values, and I've seen javier> that if a do the analisys for a subset of the javier> dataframe, selecting just columns with no NA, the javier> result is good. Could this be a bug? javier> Thanks, and best regards javier> Javier
[Reverted back to R-help, after private exchange]>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Thu, 12 Aug 2004 17:12:01 +0200 writes:>>>>> "javier" == javier garcia <- CEBAS <rn001 at cebas.csic.es>> >>>>> on Thu, 12 Aug 2004 16:28:27 +0200 writes:javier> Martin; Yes I know that there are variables with all javier> five values 'NA'. I've left them as they are just javier> because of saving a couple of lines in the script, javier> and because I like to see that they are there, javier> although all values are 'NA'. I don't expect they javier> are used in the analysis, but are they the source of javier> the problem? MM> yes, but only because of "stand = TRUE". MM> Yes, one could imagine that it might be good when MM> standardizing these "all NA variables" would work MM> I'll think a bit more about it. Thank you for the MM> example. Ok. I've thought (and looked at the R code) a bit longer. Also considered the fact (you mentioned) that this worked in R 1.8.0. Hence, I'm considering the current behavior a bug. Here is the patch (apply to cluster/R/daisy.q in the *source* or at the appriopriate place in <cluster_installed>/R/cluster ) : --- daisy.q 2004/06/25 16:17:47 1.17 +++ daisy.q 2004/08/12 15:23:26 @@ -78,8 +78,8 @@ if(all(type2 == "I")) { if(stand) { x <- scale(x, center = TRUE, scale = FALSE) #-> 0-means - sx <- colMeans(abs(x)) - if(any(sx == 0)) { + sx <- colMeans(abs(x), na.rm = TRUE)# can still have NA's + if(0 %in% sx) { warning(sQuote("x"), " has constant columns ", pColl(which(sx == 0)), "; these are standardized to 0") sx[sx == 0] <- 1 Thank you for helping to find and fix this bug. Martin Maechler, ETH Zurich, Switzerland javier> El Jue 12 Ago 2004 15:11, MM escribi??: >>> Javier, I could well read your .RData and try your >>> script to produce the same error from daisy(). >>> >>> Your dataframe is of dimension 5 x 180 and has many >>> variables that have all five values 'NA' (see below). >>> >>> You can't expect to use these, do you? Martin