Martin Keller-Ressel
2020-Sep-23 10:32 UTC
[R] jitter-bug? problematic behaviour of the jitter function
Dear all, i have noticed some strange behaviour in the ?jitter? function in R. On the help page for jitter it is stated that "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) and a is the amount argument (if specified).? and "If amount is NULL (default), we set a <- factor * d/5 where d is the smallest difference between adjacent unique (apart from fuzz) x values.? This works fine as long as there is no (very) large outlier> jitter(c(1,2,10^4)) # desired behaviour[1] 1.083243 1.851571 9999.942716 But for very large outliers the added noise suddenly ?jumps? to a much larger scale:> jitter(c(1,2,10^5)) # bad behaviour[1] -19535.649 9578.702 115693.854 # Noise should be of order (2-1)/5 = 0.2 but is of much larger order. This probably does not matter much when jitter is used for plotting, but it can cause problems when jitter is used to break ties. best regards, Martin -------------------------------- Martin Keller-Ressel Professor f?r Stochastische Analysis und Finanzmathematik Technische Universit?t Dresden Institut f?r Mathematische Stochastik Willersbau B 316, Zellescher Weg 12-14 01062 Dresden -------------------------------- [[alternative HTML version deleted]]
Duncan Murdoch
2020-Sep-23 14:57 UTC
[R] jitter-bug? problematic behaviour of the jitter function
On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:> Dear all, > > i have noticed some strange behaviour in the ?jitter? function in R. > On the help page for jitter it is stated that > > "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) and a is the amount argument (if specified).? > > and > > "If amount is NULL (default), we set a <- factor * d/5 where d is the smallest difference between adjacent unique (apart from fuzz) x values.? > > This works fine as long as there is no (very) large outlier > >> jitter(c(1,2,10^4)) # desired behaviour > [1] 1.083243 1.851571 9999.942716 > > But for very large outliers the added noise suddenly ?jumps? to a much larger scale: > >> jitter(c(1,2,10^5)) # bad behaviour > [1] -19535.649 9578.702 115693.854 > # Noise should be of order (2-1)/5 = 0.2 but is of much larger order. > > This probably does not matter much when jitter is used for plotting, but it can cause problems when jitter is used to break ties.I think this is kind of documented: "apart from fuzz" is what counts. If you look at the code for jitter, you'll see this important line: d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z)))))) By the time you get here, z is the length of the rante of the data, so it's 99999 in your example. The rounding changes your values to 0,0,1e5, so the smallest difference is 1e5. Duncan Murdoch
Rui Barradas
2020-Sep-23 19:32 UTC
[R] jitter-bug? problematic behaviour of the jitter function
Hello, R 4.0.2 on Ubuntu 20.04, sessionInfo at end. This came up in r-help, I'm answering to the OP and also posting to r-devel since I believe it is more appropriate there. I can confirm this. The original instructions are the first and the last, but even with smaller numbers the error shows up. set.seed(2020) jitter(c(1,2,10^4)) # desired behaviour #[1] 1.058761 1.957690 10000.047401 jitter(c(0,1,10^4)) # bad behaviour #[1] -92.43546 -1454.61126 8269.53754 jitter(c(-1,0,10^4)) # bad behaviour #[1] -1484.3895 -427.5283 8010.3308 jitter(c(1,2,10^5)) # bad behaviour #[1] 4809.238 10578.561 109753.430 To the OP: I am cc-ing this to r-devel at r-project.org. Questions like this are about R itself and should be posted there. sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C [3] LC_TIME=pt_PT.UTF-8 LC_COLLATE=pt_PT.UTF-8 [5] LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=pt_PT.UTF-8 [7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_4.0.2 Hope this helps, Rui Barradas ?s 11:32 de 23/09/20, Martin Keller-Ressel escreveu:> Dear all, > > i have noticed some strange behaviour in the ?jitter? function in R. > On the help page for jitter it is stated that > > "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) and a is the amount argument (if specified).? > > and > > "If amount is NULL (default), we set a <- factor * d/5 where d is the smallest difference between adjacent unique (apart from fuzz) x values.? > > This works fine as long as there is no (very) large outlier > >> jitter(c(1,2,10^4)) # desired behaviour > [1] 1.083243 1.851571 9999.942716 > > But for very large outliers the added noise suddenly ?jumps? to a much larger scale: > >> jitter(c(1,2,10^5)) # bad behaviour > [1] -19535.649 9578.702 115693.854 > # Noise should be of order (2-1)/5 = 0.2 but is of much larger order. > > This probably does not matter much when jitter is used for plotting, but it can cause problems when jitter is used to break ties. > > best regards, > Martin > > -------------------------------- > Martin Keller-Ressel > Professor f?r Stochastische Analysis und Finanzmathematik > Technische Universit?t Dresden > Institut f?r Mathematische Stochastik > Willersbau B 316, Zellescher Weg 12-14 > 01062 Dresden > -------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Rui Barradas
2020-Sep-23 20:03 UTC
[R] jitter-bug? problematic behaviour of the jitter function
Hello, I believe that though Duncan's explanation is right it is also not explaining the value of the digits argument. round makes the first 2 numbers 0 but why? The function below prints the digits argument and then outputs d. The code is taken from jitter. f <- function(x){ z <- diff(r <- range(x[is.finite(x)])) cat("digits:", 3 - floor(log10(z)), "\n") diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z)))))) } Now see what cat outputs for 'digits'. f(c(1,2,10^4)) # desired behaviour #digits: 0 #[1] 1 9998 f(c(0,1,10^4)) # bad behaviour #digits: -1 #[1] 10000 f(c(-1,0,10^4)) # bad behaviour #digits: -1 #[1] 10000 f(c(1,2,10^5)) # bad behaviour #digits: -1 #[1] 1e+05 And according to the documentation of ?round, negative digits are allowed: Rounding to a negative number of digits means rounding to a power of ten, so for example round(x, digits = -2) rounds to the nearest hundred. But in this case two of the numbers are closer to 0 than they are of 10. And unique keeps only 0 and the largest, then diff is big. round(c(1,2,10^4),0) # desired behaviour #[1] 1 2 10000 round(c(0,1,10^4),-1) # bad behaviour #[1] 0 0 10000 round(c(-1,0,10^4),-1) # bad behaviour #[1] 0 0 10000 round(c(1,2,10^5),-1) # bad behaviour #[1] 0e+00 0e+00 1e+05 Isn't it still a bug? Rui Barradas ?s 15:57 de 23/09/20, Duncan Murdoch escreveu:> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote: >> Dear all, >> >> i have noticed some strange behaviour in the ?jitter? function in R. >> On the help page for jitter it is stated that >> >> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) >> and a is the amount argument (if specified).? >> >> and >> >> "If amount is NULL (default), we set a <- factor * d/5 where d is the >> smallest difference between adjacent unique (apart from fuzz) x values.? >> >> This works fine as long as there is no (very) large outlier >> >>> jitter(c(1,2,10^4))? # desired behaviour >> [1]??? 1.083243??? 1.851571 9999.942716 >> >> But for very large outliers the added noise suddenly ?jumps? to a much >> larger scale: >> >>> jitter(c(1,2,10^5)) # bad behaviour >> [1] -19535.649?? 9578.702 115693.854 >> # Noise should be of order (2-1)/5? = 0.2 but is of much larger order. >> >> This probably does not matter much when jitter is used for plotting, >> but it can cause problems when jitter is used to break ties. > > I think this is kind of documented:? "apart from fuzz" is what counts. > If you look at the code for jitter, you'll see this important line: > > ?d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z)))))) > > By the time you get here, z is the length of the rante of the data, so > it's 99999 in your example.? The rounding changes your values to > 0,0,1e5, so the smallest difference is 1e5. > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.