Hi, Summary: I ran into some unexpected behavior in approx() and tapply() that introduced NA's in "clean" data due to (?) numerical accuracy/round off. The culprit seems to be in match() that coerces it's arguments to character, loosing precision in the process. [R development version 1.2.0, 08 Nov 2000, on Linux] Example:> r > [1] 0.6931472 0.6931472 0.6931472 0.6931472 0.6931472 1.3373177 1.8052529 > [8] 3.3902202 4.5529401the 5th element is not exactly equal to the first 4 elements:> diff(r) > [1] 0.000000e+00 0.000000e+00 0.000000e+00 1.110223e-16 6.441705e-01 > [6] 4.679352e-01 1.584967e+00 1.162720e+00which agrees with unique(r)> unique(r) > [1] 0.6931472 0.6931472 1.3373177 1.8052529 3.3902202 4.5529401(note that .Machine$double.eps is [1] 2.220446e-16, twice the 4th diff.) Now the following call to approx() erroneously introduces an NA:> approx(r, rec.p, ties = min)Error in approx(r, rec.p, ties = "min") : approx(): attempted to interpolate NA values where the value of rec.p is> rec.p[1] 0.050 0.100 0.250 0.500 0.750 0.900 0.950 0.990 0.999 Looking at approx we see a call to tapply, which is producing the NA> tapply(rec.p, r, min)0.693147180559945 0.693147180559945 1.33731770200756 1.80525289211574 0.050 NA 0.900 0.950 3.39022015484073 4.55294011989587 0.990 0.999 Upon inspection, we see that tapply coerces its INDEX (second) argument to factors, which in turn does a match():> match(r, unique(r))[1] 1 1 1 1 1 3 4 5 6 which shows that match() incorrectly assigned the 5th entry in r to the first element of unique(r) -- not the 2nd one. (We also see that index 2 does not appear on the output of match() at all.) Thus match() seems to be introducing the round off as it coerces its arguments to character:> matchfunction (x, table, nomatch = NA) .Internal(match(as.character(x), as.character(table), nomatch)) I'm enclosing the "r" and "rec.p" vectors as binary R data. I wish I could send the ASCII versions, but then I would be introducing rounding and the example above could not be reproduced. By the way, I *do* realize that it's my (i.e., user) responsibility to deal myself with such accuracy/precision issues (which I wishfully thought I was by specifying "ties = min" to approx). Nevertheless, there seems to be some inconsistency somewhere. With regards, David A. James Statistics Research, Room 2C-253 Phone: (908) 582-3082 Bell Labs, Lucent Technologies Fax: (908) 582-3340 Murray Hill, NJ 09794-0636 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 381 bytes Desc: tmp.Rdata Url : https://stat.ethz.ch/pipermail/r-help/attachments/20001120/8241c4c3/attachment.obj
David James <dj at research.bell-labs.com> writes:> > [1] 0.000000e+00 0.000000e+00 0.000000e+00 1.110223e-16 6.441705e-01 > > [6] 4.679352e-01 1.584967e+00 1.162720e+00 > > which agrees with unique(r) > > > unique(r) > > [1] 0.6931472 0.6931472 1.3373177 1.8052529 3.3902202 4.5529401 > > (note that .Machine$double.eps is [1] 2.220446e-16, twice the 4th diff.)Yep, but since the number is < 1 and double.eps is a relative quantity, it's probably still a last-bit representation issue in 64 bit doubles (i.e. it is not the infamous Intel extended precision playing tricks on us again.) -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> "DJ" == David James <dj@research.bell-labs.com> writes:{diverted from R-help to R-devel } DJ> Summary: DJ> I ran into some unexpected behavior in approx() and tapply() that DJ> introduced NA's in "clean" data due to (?) numerical accuracy/round DJ> off. The culprit seems to be in match() that coerces it's DJ> arguments to character, loosing precision in the process. yes, definitely! match() shouldn't do the character coercion! ==> This is a bug; I'll send a report ........... DJ> Thus match() seems to be introducing the round off as it coerces its DJ> arguments to character: >> match DJ> function (x, table, nomatch = NA) DJ> .Internal(match(as.character(x), as.character(table), nomatch)) ....... If you look at R's source { do_match() in src/main/unique.c }, it becomes clear that the .Internal(match(.)) does not rely at all on character inputs, and match <- function (x, table, nomatch = NA) .Internal(match(x, table, nomatch)) *seems* to work alright. But, alas, it only *seems*, since the above change breaks "make check" more concretely, example(relevel) leads to an error {when the above changed match() function is used} which can be diagnosed to factor(.,.) not working anymore with the changed match() function. I've seen how to fix factor(.,.) as well, but this is coming involved. I think that the above change is *really* what we should use, but we the consequences must be taken care as well. Expect a fix for R 1.2, and thank you very much for your useful report! Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO D10 Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 <>< -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Possibly Parallel Threads
- precision, incorrect(?) tapply() NA's
- How to remove 'NA's?
- Weird erratic error and illogical error message, could someone explain this?
- Feature request: non-dropping regmatches/strextract
- Argument "nomatch" matched by multiple actual arguments ... %in% -> match?!?