Petr Savicky
2009-May-23 07:44 UTC
[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Function factor() in the current development version (2009-05-22) guarantees that levels are different character strings. However, they may represent the same decimal number. The following example is derived from a posting by Stavros Macrakis in thread "Match .3 in a sequence" in March nums <- 0.3 + 2e-16 * c(-2,-1,1,2) f <- factor(nums) levels(f) # [1] "0.300000000000000" "0.3" The levels differ in trailing zeros, but represent the same decimal number. Besides that this is not really meaningful, it may cause a problem, when using as.numeric(levels(f)). In the above case, as.numeric() works fine and maps the two levels to the same number. However, there are cases, where the difference in trailing zeros implies different values in as.numeric(levels(f)) and these values may even form a decreasing sequence although levels were constructed from an increasing sequence of numbers. Examples are platform dependent, but may be found by the following code. Tested on Intel under Linux (both with and without SSE) and also under Windows with an older version of R. for (i in 1:100000) { x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2) x <- as.numeric(sprintf("%.14g", x)) eps <- 2^(floor(log2(x)) - 52) k <- round(x * c(5e-16, 1e-15) / eps) if (x > 1e62) { k <- rev( - k) } y <- x + k[1]:k[2] * eps ind <- which(diff(as.numeric(as.character(y))) < 0) for (j in ind) { u1 <- y[c(j, j+1)] u2 <- factor(u1) print(levels(u2)) print(diff(as.numeric(levels(u2)))) aux <- readline("next") } } An example of the output is [1] "1.2296427920313e+61" "1.22964279203130e+61" [1] -1.427248e+45 next [1] "1.82328862326830e+62" "1.8232886232683e+62" [1] -2.283596e+46 next The negative number in diff(as.numeric(levels(u2))) demonstrates cases, when as.numeric(levels(u2)) is decreasing. We can also see that the reason is that the two strings in levels(u2) differ in the trailing zeros. I did quite intensive search for such examples for all possible exponents (not only 61 and 62 and a week of CPU on three processors) and all the obtained examples were caused by a difference in trailing zeros. So, i believe that removing trailing zeros from the output of as.character(x) solves the problem with the reversed order in as.numeric(levels(factor(x))) entirely. A patch against R-devel_2009-05-22, which eliminates trailing zeros from as.character(x), but makes no other changes to as.character(x), is in an attachment. Using the patch, we obtain a better result also in the following. nums <- 0.3 + 2e-16 * c(-2,-1,1,2) factor(nums) # [1] 0.3 0.3 0.3 0.3 # Levels: 0.3 Petr. -------------- next part -------------- --- R-devel/src/main/coerce.c 2009-04-17 17:53:35.000000000 +0200 +++ R-devel-elim-trailing/src/main/coerce.c 2009-05-23 08:39:03.914774176 +0200 @@ -294,12 +294,33 @@ else return mkChar(EncodeInteger(x, w)); } +const char *elim_trailing(const char *s, char cdec) +{ + const char *p; + char *replace; + for (p = s; *p; p++) { + if (*p == cdec) { + replace = (char *) p++; + while ('0' <= *p & *p <= '9') { + if (*(p++) != '0') { + replace = (char *) p; + } + } + while (*(replace++) = *(p++)) { + ; + } + break; + } + } + return s; +} + SEXP attribute_hidden StringFromReal(double x, int *warn) { int w, d, e; formatReal(&x, 1, &w, &d, &e, 0); if (ISNA(x)) return NA_STRING; - else return mkChar(EncodeReal(x, w, d, e, OutDec)); + else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec)); } SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn)
Martin Maechler
2009-May-27 20:51 UTC
[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz> >>>>> on Sat, 23 May 2009 09:44:54 +0200 writes:PS> Function factor() in the current development version PS> (2009-05-22) guarantees that levels are different PS> character strings. However, they may represent the same PS> decimal number. The following example is derived from a PS> posting by Stavros Macrakis in thread "Match .3 in a PS> sequence" in March PS> nums <- 0.3 + 2e-16 * c(-2,-1,1,2) f <- factor(nums) PS> levels(f) # [1] "0.300000000000000" "0.3" PS> The levels differ in trailing zeros, but represent the PS> same decimal number. Besides that this is not really PS> meaningful, it may cause a problem, when using PS> as.numeric(levels(f)). PS> In the above case, as.numeric() works fine and maps the PS> two levels to the same number. However, there are cases, PS> where the difference in trailing zeros implies different PS> values in as.numeric(levels(f)) and these values may PS> even form a decreasing sequence although levels were PS> constructed from an increasing sequence of numbers. PS> Examples are platform dependent, but may be found by the PS> following code. Tested on Intel under Linux (both with PS> and without SSE) and also under Windows with an older PS> version of R. PS> for (i in 1:100000) { PS> x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2) PS> x <- as.numeric(sprintf("%.14g", x)) PS> eps <- 2^(floor(log2(x)) - 52) PS> k <- round(x * c(5e-16, 1e-15) / eps) PS> if (x > 1e62) { k <- rev( - k) } PS> y <- x + k[1]:k[2] * eps PS> ind <- which(diff(as.numeric(as.character(y))) < 0) PS> for (j in ind) { PS> u1 <- y[c(j, j+1)] PS> u2 <- factor(u1) PS> print(levels(u2)) PS> print(diff(as.numeric(levels(u2)))) PS> aux <- readline("next") PS> } PS> } PS> An example of the output is PS> [1] "1.2296427920313e+61" "1.22964279203130e+61" PS> [1] -1.427248e+45 PS> next PS> [1] "1.82328862326830e+62" "1.8232886232683e+62" PS> [1] -2.283596e+46 PS> next PS> The negative number in diff(as.numeric(levels(u2))) demonstrates cases, PS> when as.numeric(levels(u2)) is decreasing. We can also see that the reason PS> is that the two strings in levels(u2) differ in the trailing zeros. PS> I did quite intensive search for such examples for all possible exponents PS> (not only 61 and 62 and a week of CPU on three processors) and all the obtained PS> examples were caused by a difference in trailing zeros. So, i believe that PS> removing trailing zeros from the output of as.character(x) solves the problem PS> with the reversed order in as.numeric(levels(factor(x))) entirely. PS> A patch against R-devel_2009-05-22, which eliminates trailing zeros PS> from as.character(x), but makes no other changes to as.character(x), PS> is in an attachment. Using the patch, we obtain a better result also PS> in the following. PS> nums <- 0.3 + 2e-16 * c(-2,-1,1,2) PS> factor(nums) PS> # [1] 0.3 0.3 0.3 0.3 PS> # Levels: 0.3 yes, of course. PS> Petr. [.....] I have very slightly modified the changes (to get rid of -Wall warnings) and also exported the function as Rf_dropTrailing0(), and tested the result with 'make check-all' . As the change seems reasonable and consequent, and as it seems not to produce any problems in our tests, I'm hereby proposing to commit it (my version of it), [to R-devel only] within a few days, unless someone speaks up. Martin Maechler, ETH Zurich PS> --- R-devel/src/main/coerce.c 2009-04-17 17:53:35.000000000 +0200 PS> +++ R-devel-elim-trailing/src/main/coerce.c 2009-05-23 08:39:03.914774176 +0200 PS> @@ -294,12 +294,33 @@ PS> else return mkChar(EncodeInteger(x, w)); PS> } PS> +const char *elim_trailing(const char *s, char cdec) PS> +{ PS> + const char *p; PS> + char *replace; PS> + for (p = s; *p; p++) { PS> + if (*p == cdec) { PS> + replace = (char *) p++; PS> + while ('0' <= *p & *p <= '9') { PS> + if (*(p++) != '0') { PS> + replace = (char *) p; PS> + } PS> + } PS> + while (*(replace++) = *(p++)) { PS> + ; PS> + } PS> + break; PS> + } PS> + } PS> + return s; PS> +} PS> + PS> SEXP attribute_hidden StringFromReal(double x, int *warn) PS> { PS> int w, d, e; PS> formatReal(&x, 1, &w, &d, &e, 0); PS> if (ISNA(x)) return NA_STRING; PS> - else return mkChar(EncodeReal(x, w, d, e, OutDec)); PS> + else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec)); PS> } PS> SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn)
Wacek Kusnierczyk
2009-Jun-01 14:30 UTC
[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote:>>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz> >>>>>> on Sun, 31 May 2009 10:29:41 +0200 writes: >>>>>> > > [............] > > PS> I appreciate the current version, which contains static > PS> const char* dropTrailing0(char *s, char cdec) ... > PS> mkChar(dropTrailing0((char *)EncodeReal(x, w, d, e, > PS> OutDec), ... > > PS> Here, is better visible that the cast (char *) is used > PS> than if it was hidden inside dropTrailing0(). Also, it > PS> makes dropTrailing0() more consistent. > > PS> I would like to recall the already discussed > PS> modification if (replace != p) while((*(replace++) > PS> *(p++))) ; which saves a few instructions in the more > PS> frequent case that there are no trailing zeros. > > Yes, thank you. This already was in my working version, > and I had managed to lose it again. > Will put i back > > .... still hoping this topic would be closed now ... >i would rather hope for the EncodeReal flaw to be repaired... vQ