Here are some data frames: df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4, 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <- data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) Now here are some commands and their answers: > median(df3.2) [1] 2 8 > median(df4.2) [1] 2.5 8.5 > median(df3.3) NA 1 7 2 8 3 9 > median(df4.3) NA 1 7 2 8 3 9 4 10 > median(df3.4) [1] 8 11 > median(df4.4) [1] 8.5 11.5 > median(df3.2[c(1,2,3),]) [1] 2 8 > median(df3.2[c(1,3,2),]) [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA The sessionInfo is below, but it looks to me like the present behavior started in 2.10.0. Sometimes it gets the right answer. I'd be grateful to hear how it does that -- I can't figure it out. Under the current regime we can get numbers that are correct, partially correct, or sort of random (given the intention). I claim that much better behavior would be to always get exactly one of the following: * a numeric answer (that is consistently correct) * an error I would think a method in analogy to 'mean.data.frame' would be a logical choice. But I'm presuming there might be an argument against that or 'median.data.frame' would already exist. > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] xts_0.8-0 zoo_1.6-5 loaded via a namespace (and not attached): [1] grid_2.13.0 lattice_0.19-23 tools_2.13.0 -- Patrick Burns pburns at pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno')
On Apr 27, 2011, at 19:44 , Patrick Burns wrote:> I would think a method in analogy to > 'mean.data.frame' would be a logical choice. > But I'm presuming there might be an argument > against that or 'median.data.frame' would already > exist.Only if someone had a better plan. As you are probably well aware, what you are currently seeing is a rather exquisite mashup of methods getting applied to objects they shouldn't be applied to. Some curious effects are revealed, e.g. this little beauty:> sort(df3.3)Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : undefined columns selected> names(df3.3)<-NULL > sort(df3.3)NA NA NA NA NA NA NA NA NA 1 1 7 10 NULL NULL NULL NULL NULL NULL 2 2 8 11 <NA> <NA> <NA> <NA> <NA> <NA> 3 3 9 12 <NA> <NA> <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns <pburns at pburns.seanet.com> wrote:> Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers:>> median(df4.4) > [1] ?8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] ?2 NA > Warning message: > In mean.default(X[[2L]], ...) : > ?argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. ?I'd > be grateful to hear how it does that -- I > can't figure it out. >Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer> apply(df3.3, 2, median)X1.3 X7.9 X10.12 2 8 11> apply(df3.2, 2, median)X1.3 X7.9 2 8> apply(df3.2[c(1,3,2),], 2, median)X1.3 X7.9 2 8 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?????????????????????? median.data.frame <- function(x,...) sapply(x, median, ...) ????????????????????? It works, see:> median.data.frame(df3.2[c(1,3,2),])X1.3 X7.9 2 8 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default> median(df3.2[c(1,3,2),])Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at <tmp>#4: NULL Browse[2]> n debug at <tmp>#6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at <tmp>#6: names(x) <- NULL Browse[2]> n debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: NULL Browse[2]> n debug at <tmp>#12: n <- length(x) Browse[2]> n debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#13: NULL Browse[2]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output.> median(df3.2)Called from: median.default(df3.2) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 <NA> <NA> <NA> <NA> 3 3 9 <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom.> median(df3.2[c(1,3,2),])Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 <NA> 9 <NA> <NA> <NA> 2 2 <NA> 8 <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL> df8.8 <- data.frame(a=2:8, b=1:7)median(df8.8) debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NULL 2 NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 2 <NA> 3 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 <NA> 4 <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 <NA> 5 <NA> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 <NA> 6 <NA> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6 <NA> 7 <NA> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 7 <NA> 8 <NA> 7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Run ?sort and you see it was not intended for data.frames. In conclusion, I think median applied to a data.frame causes undefined behavior because median is not intending to deal with several columns at once. I don't see any changes in median.default that would explain the changes you see. Compare:>From R-2.13, src/library/stats/R/median.Rmedian.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) }>From R-2.9.2median.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) } pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
This seems trivially fixable using something like median.data.frame <- function(x, na.rm=FALSE) { sapply(x, function(y, na.rm=FALSE) if(is.factor(y)) NA else median(y, na.rm=na.rm), na.rm=na.rm) }>>> Paul Johnson <pauljohn32 at gmail.com> 28/04/2011 06:20 >>>On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns <pburns at pburns.seanet.com> wrote:> Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers:>> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. >Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer> apply(df3.3, 2, median)X1.3 X7.9 X10.12 2 8 11> apply(df3.2, 2, median)X1.3 X7.9 2 8> apply(df3.2[c(1,3,2),], 2, median)X1.3 X7.9 2 8 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?????????????????????? median.data.frame <- function(x,...) sapply(x, median, ...) ????????????????????? It works, see:> median.data.frame(df3.2[c(1,3,2),])X1.3 X7.9 2 8 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default> median(df3.2[c(1,3,2),])Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at <tmp>#4: NULL Browse[2]> n debug at <tmp>#6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at <tmp>#6: names(x) <- NULL Browse[2]> n debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: NULL Browse[2]> n debug at <tmp>#12: n <- length(x) Browse[2]> n debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#13: NULL Browse[2]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output.> median(df3.2)Called from: median.default(df3.2) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 <NA> <NA> <NA> <NA> 3 3 9 <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom.> median(df3.2[c(1,3,2),])Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 <NA> 9 <NA> <NA> <NA> 2 2 <NA> 8 <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL> df8.8 <- data.frame(a=2:8, b=1:7)median(df8.8) debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NULL 2 NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 2 <NA> 3 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 <NA> 4 <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 <NA> 5 <NA> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 <NA> 6 <NA> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6 <NA> 7 <NA> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 7 <NA> 8 <NA> 7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Run ?sort and you see it was not intended for data.frames. In conclusion, I think median applied to a data.frame causes undefined behavior because median is not intending to deal with several columns at once. I don't see any changes in median.default that would explain the changes you see. Compare:>From R-2.13, src/library/stats/R/median.Rmedian.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) }>From R-2.9.2median.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) } pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}