Vince Carey found this (thank you!). Since the fix to the problem is not entirely obvious, I post this to R-devel as RFC: help(dist) says: >> Missing values are allowed, and are excluded from all computations >> involving the rows within which they occur. If some columns are >> excluded in calculating a Euclidean, Manhattan or Canberra >> distance, the sum is scaled up proportionally to the number of >> columns used. If all pairs are excluded when calculating a >> particular distance, the value is `NA'. but the C code in ....../src/library/mva/src/distance.c, has, e.g. for the Euclidean distance : count= 0; dist = 0; for(j = 0 ; j < nc ; j++) { if(R_FINITE(x[i1]) && R_FINITE(x[i2])) { dev = (x[i1] - x[i2]); dist += dev * dev; count++; } i1 += nr; i2 += nr; } if(count == 0) return NA_REAL; if(count != nc) dist /= ((double)count/nc); return sqrt(dist); where it is clear that "R_FINITE(*)" should in principle be replaced by "!ISNAN(*)". Note however that "Inf - Inf -> NaN" and e.g the canberra metric has more ways to get "NaN". The current code drops all pairs with an +-Inf in it. I would be inclined to really replace if(R_FINITE(x[i1]) && R_FINITE(x[i2])) { by if(!ISNAN(x[i1]) && !ISNAN(x[i2])) { for all metrics -- for R-patched. But I'd also see reasons where we'd want to be smarter/different than that. One possibility would be to drop the pair (as currently) also when both are not finite, for "binary" to signal an error for +- Inf, and for "Canberra" : Of course, d = |x - y| / |x + y| will be 1 , when one of {x,y} is infinite. This could be considered the desired answser, or also we may give a warning in any case. Opinions? Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 <>< -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2002-Oct-21 16:44 UTC
dist() {"mva" package} bug: treats +/- Inf as NA
I think this is definitely better left as is (and definitely so for 1.6.1), perhaps documenting the fact. What is done seems quite sensible to me: you really can't have infinite values in an L_2 or L_1 space, and so Euclidean and L_1 (= Manhattan) distances are not defined. One could argue for just one infinity point, with zero distance between infinity points and infinite to all finite points, for example (and NA to missing values). It really isn't just one answer. On Mon, 21 Oct 2002, Martin Maechler wrote:> Vince Carey found this (thank you!). > Since the fix to the problem is not entirely obvious, I post > this to R-devel as RFC: > > help(dist) says: > > >> Missing values are allowed, and are excluded from all computations > >> involving the rows within which they occur. If some columns are > >> excluded in calculating a Euclidean, Manhattan or Canberra > >> distance, the sum is scaled up proportionally to the number of > >> columns used. If all pairs are excluded when calculating a > >> particular distance, the value is `NA'. > > but the C code in ....../src/library/mva/src/distance.c, > has, e.g. for the Euclidean distance : > > count= 0; > dist = 0; > for(j = 0 ; j < nc ; j++) { > if(R_FINITE(x[i1]) && R_FINITE(x[i2])) { > dev = (x[i1] - x[i2]); > dist += dev * dev; > count++; > } > i1 += nr; > i2 += nr; > } > if(count == 0) return NA_REAL; > if(count != nc) dist /= ((double)count/nc); > return sqrt(dist); > > where it is clear that "R_FINITE(*)" should in principle be > replaced by "!ISNAN(*)". > > Note however that "Inf - Inf -> NaN" and e.g the canberra metric > has more ways to get "NaN". > > The current code drops all pairs with an +-Inf in it. > I would be inclined to really replace > if(R_FINITE(x[i1]) && R_FINITE(x[i2])) { > by > if(!ISNAN(x[i1]) && !ISNAN(x[i2])) { > > for all metrics -- for R-patched. > > But I'd also see reasons where we'd want to be smarter/different > than that. One possibility would be to drop the pair (as > currently) also when both are not finite, > for "binary" to signal an error for +- Inf, > and for "Canberra" : > Of course, d = |x - y| / |x + y| > will be 1 , when one of {x,y} is infinite. > This could be considered the desired answser, or also > we may give a warning in any case. > > Opinions? > > > Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ > Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 > ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND > phone: x-41-1-632-3408 fax: ...-1228 <>< > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._