atp@piskorski.com
2005-Oct-09 19:04 UTC
[Rd] [ subscripting sometimes loses names (PR#8192)
--rwEMma7ioTxnRzrJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline R, like recent versions of S-Plus, sometimes - but not always - loses names when subscripting objects with "[". (Earlier versions of S and S-Plus had the correct, name-preserving behavior.) This seems bad, it would be better to remove names only by explicit request, not as an accidental side-effect of some (but not all) subscripting operations. This issue was also discusses back in 2001 on the S-News list: http://www.biostat.wustl.edu/archives/html/s-news/2001-09/msg00020.html The attached file, "fix-names.s", is also available here: http://www.piskorski.com/R/patches/fix-names.s It includes: 1. The function dtk.test.brace.names(), which demonstrates name losing problem, and can automatically report which test cases pass/fail, etc. 2. Wrappers for the "[" and "[.data.frame" functions which fix the losing names problem for all the cases I've tried. Note that dtk.test.brace.names(T) will always run all its test cases and return their output for human inspection. However, its checks to see whether each test passes or fails only work correctly with the patched all.equal() in PR#8191. My coworkers and I have been using these wrapper functions for ALL code we run for many months now, with no problems so far. However, there are probably some cases we don't use, like objects with S4 classes, which don't work right with these wrappers. I assume the R core team would NOT want to use these wrapper functions, but would instead prefer to change the underlying code directly. However, I offer them as an example of one way to achieve what we believe to be the correct name-preserving behavior in R. I would appreciate any suggestions on how to better implement this name-preserving behavior for all R subscripting operations. -- Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/ --rwEMma7ioTxnRzrJ Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="fix-names.s" dtk.null <- function(...) {} # # Fix loss of dimnames when subcripting with "[": # # According to Gary Sabot <gary at sabot.com>, S-Plus originally had the # correct name-preserving behavior we want. Then in 1996 Insightful # broke that in Splus 3.4, which Gary fixed for his own use. In 2001, # Splus 6.0 broke something in Gary's fix, he posted questions to the # s-news list, generating discussion, including some shock that anyone # could think that arbitrarily losing dimnames made sense: # # http://www.biostat.wustl.edu/archives/html/s-news/2001-09/msg00020.html # # Unfortunately R seems to mimic much of the more recent buggy S-Plus # behavior! Thus our patches to both S-Plus and R below. To test # that they do the right thing, run dtk.test.brace.names(). # # Note that currently these patches wrap the stock subscripting # functions, they do NOT replace them. TODO: Investigate fixing the # stock implementation instead, especially for R. # # --atp at piskorski.com, 2005/09/27 17:03 EDT # if (!.R.) { # For S-Plus: # First make sure that if you run this twice, you still get the # real original function: data.frame.original.fcn <- get("[.data.frame", where="splus") # For S-Plus (at least version 6.2.1) we do not need to override # the "[" function, it already does the right thing. # --atp at piskorski.com, 2005/07/01 10:11 EDT "[.data.frame" <- function(x ,... ,drop=T) { # TODO: Problem: New lm() is dying with my patch, because it indexes something # and produces something that looks like it has 2 cols, but ncol returns 1. # I can detect/avoid this case since it has class = c("model.frame", "data.frame") # see test case below where I figured out this issue in case it needs further work. class.x <- class(x) caller <- sys.call(sys.parent())[[2]] if (length(class.x)==1 && class.x=="data.frame" && (mode(caller) != "name" || (caller != "value"))) { # If caller is a name and it is "value", then it is the # lhs case that we just want the original fcn to handle: result <- data.frame.original.fcn(x, ..., drop=F) if (drop && length(ncol(result) > 0) && ncol(result)==1) { save.names <- dimnames(result)[[1]] #this approach works for factors too result <- result[[1]] names(result) <- save.names # TODO: Unfortunately still broken for objects with new # style classes, since it does not distinguish among # methods that have or do not have a getnames method. # library(missing) is an example: The multiple imputations # on an object get lost if subscripted with this function. } else { if (!missing(drop) && drop && length(nrow(result)) > 0 && nrow(result)==1) { #replicate documented behavior of [.data.frame: drop=T acts #differently then missing drop arg for this case! result <- as.list(result) } } } else { result <- data.frame.original.fcn(x, ..., drop=F) } result } } else { # For R: # First make sure that if you run this twice, you still get the # real original function: # Also remove the obnoxious "drop argument will be ignored" warning # entirely from the function. I would like to regsub out the whole # warning() call, but I can't seem to get that to work. So, just # replace the first warning() call with a call to our dtk.null() # function which does nothing. Fortunately, the warning() call we # want to get rid of is indeed the first (actually the only) one: # --atp at piskorski.com, 2005/07/01 17:53 EDT brace.original.fcn <- get("[",pos="package:base") data.frame.original.fcn.0 <- get("[.data.frame",pos="package:base") data.frame.original.fcn <- data.frame.original.fcn.0 body(data.frame.original.fcn) <- parse(text=sub('warning(..?drop argument will be ignored..?)' ,'dtk.null()' ,deparse(body(data.frame.original.fcn.0)) ,ignore.case=T)) # For R (at least version 2.1.0) we need to override BOTH the # "[.data.frame" and "[" functions. # --atp at piskorski.com, 2005/07/01 10:11 EDT "[.data.frame" <- function(x ,i ,j ,... ,drop=T) { # The stock R default value for the drop arg is: # drop=(if(missing(i)) TRUE else length(names(x)) == 1) # However, that DOES cause certain differences from S-Plus, so # we do NOT use it. --atp at piskorski.com, 2005/07/01 13:18 EDT # TODO: Does above S-Plus problem with lm() also apply here? # --atp at piskorski.com, 2005/07/01 10:51 EDT class.x <- class(x) caller <- sys.call(sys.parent())[[2]] if (length(class.x)==1 && class.x=="data.frame" && (mode(caller) != "name" || (caller != "value"))) { # If caller is a name and it is "value", then it is the # lhs case that we just want the original fcn to handle: code <- 'data.frame.original.fcn(x,' if (!missing(i)) code <- paste(code ,'i' ,sep="") if (length(dim(x)) > 1 && (missing(i) || length(dim(i)) <= 1)) code <- paste(code ,',' ,sep="") if (!missing(j)) code <- paste(code ,'j' ,sep="") if (!missing(...)) code <- paste(code ,',...' ,sep="") code <- paste(code ,',drop=F' ,sep="") code <- paste(code ,')' ,sep="") #cat("Debug: code to eval: ") ; print(code) result <- eval(parse(text=code)) if (drop && length(ncol(result) > 0) && ncol(result)==1) { save.names <- dimnames(result)[[1]] #this approach works for factors too result <- result[[1]] names(result) <- save.names # TODO: Unfortunately still broken for objects with new # style classes, since it does not distinguish among # methods that have or do not have a getnames method. # library(missing) is an example: The multiple imputations # on an object get lost if subscripted with this function. } else { if (!missing(drop) && drop && length(nrow(result)) > 0 && nrow(result)==1) { #replicate documented behavior of [.data.frame: drop=T acts #differently then missing drop arg for this case! result <- as.list(result) } } } else { if (missing(i)) result <- data.frame.original.fcn(x ,... ,drop=F) else if (missing(j)) result <- data.frame.original.fcn(x ,i ,... ,drop=F) else result <- data.frame.original.fcn(x ,i ,j ,... ,drop=F) } result } # R has this problem with NA names: # # # S-Plus 6.2.1: # > foo <- c("a"=1,"b"=2,"c"=3) # > foo[c("a","c","atp")] # a c atp # 1 3 NA # # # R 2.0.0, or 2.1.0: # > foo <- c("a"=1,"b"=2,"c"=3) # > foo[c("a","c","atp")] # a c <NA> # 1 3 NA # # This is very very bad, it causes soft "you just get different # results" bugs when running our nag.optimize() from R, and # probably in many other places in our code as well. # --atp at piskorski.com, 2005/06/30 17:40 EDT "[" <- function(x ,i ,j ,... ,drop=TRUE) { # We MUST be able to blank positional arguments, and it seems # that do.call() gives us no way to do that, so use eval(): # --atp at piskorski.com, 2005/07/01 16:00 EDT code <- 'brace.original.fcn(x,' if (!missing(i)) code <- paste(code ,'i' ,sep="") if (length(dim(x)) > 1 && (missing(i) || length(dim(i)) <= 1)) code <- paste(code ,',' ,sep="") if (!missing(j)) code <- paste(code ,'j' ,sep="") if (!missing(...)) code <- paste(code ,',...' ,sep="") if (!missing(drop)) code <- paste(code ,',drop=drop' ,sep="") code <- paste(code ,')' ,sep="") #cat("Debug: code to eval: ") ; print(code) result <- eval(parse(text=code)) # This fix is being really specific, it handles this: # x[i] # But does not try to handle this: # x[i, j, ... , drop = TRUE] # --atp at piskorski.com, 2005/06/30 17:40 EDT if (is.null(attributes(x)$class) && length(list(...)) == 0) { # Just a simple index operation on a vector or list: new.names <- names(result) if (length(new.names) > 0) { # Names are present: bad.names <- is.na(new.names) if (any(bad.names)) { # Some names are NA, so fix them: names(result)[bad.names] <- brace.original.fcn(i ,bad.names) } } } result } } dtk.test.brace.names <- function (return.results.p=F ,only="all") { # Some simple test cases for our patched "[.data.frame" and "[" functions: # Note that return.results.p=T will always work correctly, but # automatically deciding if those results are CORRECT will ONLY # work if using our patched all.equal() from PR#8191, which is # available here: # http://r-bugs.biostat.ku.dk/cgi-bin/R/incoming?id=8191 # --atp at piskorski.com, 2005/10/09 13:23 EDT ### Examples of correct output for the cases below: # > (c("a"=1,"b"=2,"c"=3)[c("a","c","no")]) # a c no # 1 3 NA # > (cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6))[c(1,2),2]) # a b # 4 5 # > (cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6))[c(3,2,1),2]) # c b a # 6 5 4 # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[,2]) # a b # 3 4 # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,]) # col1 col2 # 2 4 # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,,drop=T]) # col1 col2 # 2 4 # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,,drop=F]) # col1 col2 # b 2 4 mm <- matrix(1:25 ,nrow=5 ,ncol=5 ,dimnames=list(letters[1:5],letters[6:10])) idx <- 1:(dim(mm)[1]) mm.df <- as.data.frame(mm) results <- list( "vec.1"=list((c("a"=1,"b"=2,"c"=3)[c("a","c","no")]) ,c(a=1 ,c=3 ,no=NA) ) ,"diag.1"=list( mm[cbind(idx,idx)] ,c(1,7,13,19,25) ) ,"diag.2"=list(mm.df[cbind(idx,idx)] ,c(1,7,13,19,25) ) ,"df.a.1"=list(( cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)) [c(1,2),2]) ,c(a=4,b=5) ) ,"df.b.1"=list((data.frame(cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)))[c(1,2),2]) ,c(a=4,b=5) ) ,"df.a.2"=list(( cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)) [c(3,2,1),2]) ,c(c=6,b=5,a=4) ) ,"df.b.2"=list((data.frame(cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)))[c(3,2,1),2]) ,c(c=6,b=5,a=4) ) ,"df.a.3"=list(( cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [,2]) ,c(a=3,b=4) ) ,"df.b.3"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[,2]) ,c(a=3,b=4) ) ,"df.a.4"=list(( cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,]) ,c(col1=2,col2=4) ) ,"df.b.4"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,]) ,data.frame(col1=c(b=2) ,col2=c(b=4)) ) ,"df.a.5"=list(( cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,,drop=T]) ,c( col1=2,col2=4) ) ,"df.b.5"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,,drop=T]) ,list(col1=2,col2=4) ) ,"df.a.6"=list(( cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,,drop=F]) ,cbind(col1=c(b=2) ,col2=c(4)) ) ,"df.b.6"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,,drop=F]) ,data.frame(cbind(col1=c(b=2) ,col2=c(4))) ) ) # TODO: I don't know how to check these cases: # --atp at piskorski.com, 2005/07/01 10:34 EDT # > (data.frame(a=letters[1:3] ,b=2:4)[,1]) # [1] a b c # > structure(.Data = c(1, 2, 3) ,levels = c("a", "b", "c") ,class = "factor" ,names = c("1", "2", "3")) # [1] a b c # > dput(data.frame(a=letters[1:3], b=2:4)[,1]) # structure(.Data = c(1, 2, 3) # , levels = c("a", "b", "c") # , class = "factor" # , names = c("1", "2", "3") # ) # > library("missing") # > data.frame.original.fcn(cholesterolImpExample,,3) # [1] NA NA NA NA NA NA NA NA NA 156 242 256 142 216 248 168 236 200 264 264 # [21] 188 182 294 214 198 256 280 204 # # miVariable object with 5 sets of multiple imputations # 1 2 3 4 5 # 2.chol14 190.7459 209.3937 220.1499 213.2871 218.0670 # 4.chol14 156.6425 101.6117 173.6432 129.3747 140.9941 # 5.chol14 176.8996 257.0628 157.2997 227.3769 173.9429 # 10.chol14 255.8360 275.2017 284.1347 257.6721 289.8643 # 13.chol14 252.9045 209.9261 257.6222 228.4710 270.7223 # 16.chol14 298.6389 252.4810 259.4355 332.3279 287.4087 # 18.chol14 180.9732 204.0033 194.7192 199.9087 200.9603 # 23.chol14 219.6375 216.8488 125.2673 213.3161 263.5186 # 25.chol14 268.8816 289.2164 229.5734 273.3548 240.9536 # > # > cholesterolImpExample[,3] # 2 4 5 10 13 16 18 23 25 1 3 6 7 8 9 11 12 14 15 17 19 20 21 # NA NA NA NA NA NA NA NA NA 156 242 256 142 216 248 168 236 200 264 264 188 182 294 # # 22 24 26 27 28 # 214 198 256 280 204 # # miVariable object with 5 sets of multiple imputations # 1 2 3 4 5 # 2.chol14 190.7459 209.3937 220.1499 213.2871 218.0670 # 4.chol14 156.6425 101.6117 173.6432 129.3747 140.9941 # 5.chol14 176.8996 257.0628 157.2997 227.3769 173.9429 # 10.chol14 255.8360 275.2017 284.1347 257.6721 289.8643 # 13.chol14 252.9045 209.9261 257.6222 228.4710 270.7223 # 16.chol14 298.6389 252.4810 259.4355 332.3279 287.4087 # 18.chol14 180.9732 204.0033 194.7192 199.9087 200.9603 # 23.chol14 219.6375 216.8488 125.2673 213.3161 263.5186 # 25.chol14 268.8816 289.2164 229.5734 273.3548 240.9536 # > tt <- rep(F ,times=length(results)) ; names(tt) <- names(results) for (jj in 1:length(results)) { tmp <- all.equal(results[[jj]][[1]] ,results[[jj]][[2]]) tt[jj] <- (length(tmp) == 1 && tmp == T) results[[jj]][3] <- list("pass"=as.vector(tt[jj])) } if (return.results.p) { if (only == "bad") results[!tt] else if (only == "good") results[tt] else results } else if (all(tt)) T else F } --rwEMma7ioTxnRzrJ--
maechler@stat.math.ethz.ch
2005-Oct-19 12:34 UTC
[Rd] [ subscripting sometimes loses names (PR#8192)
Andy, that's interesting, but honestly your posting only *talked* about your perceptions of bogous behavior of R and gave link to a quite extensive S source file --- which re-defines basic functions so it's not a file I'd just want to source into my R session. Proper R bug reports provide short "cut & paste" executable example code {i.e. no prompt, no output} or at least the transcript of such code {transcript : input (+ prompt) + output}. Also your script is for R and S-plus and at least in some places it seems you think R has a bug because it behaves differently than S or S-plus. Now I'm sure you know from the R-FAQ that there are quite a few intentional differences between the two dialects of S, and dealing with data frames is definitely one situation where we have tried to do better than "the prototype", so we would say the bug is with S(-plus). In spite of all the above, I'd well expect that you still know about problematic or even bogous behavior of "[" subscripting, but we'd rather see small reproducible code snippets rather than scripts that redefine "[" and "[.data.frame" and further assume a patched all.equal().. Best regards, Martin Maechler
atp@piskorski.com
2005-Oct-19 19:46 UTC
[Rd] [ subscripting sometimes loses names (PR#8192)
On Wed, Oct 19, 2005 at 02:33:50PM +0200, Martin Maechler wrote:> Proper R bug reports provide short "cut & paste" executable > example code {i.e. no prompt, no output} or at least the > transcript of such code {transcript : input (+ prompt) + output}.My patch includes the function dtk.test.brace.names() which demonstrates the problem. If you source just that function into a completely stock R, you can see the losing names problem by running: dtk.test.brace.names(return.results.p=T ,only="all") To make it easier to see just what the problem is, I'll send example output in my next email.> Also your script is for R and S-plus and at least in some places > it seems you think R has a bug because it behaves differently > than S or S-plus.No, I don't think that. If comments in my code give that impression then that's a bug in my comments, it was not my intention. My coworkers and I originally fixed the name losing problem in S-Plus, then later did so in R, so in some places I might have sloppily said, "R is different than S-Plus" when what I REALLY meant was, "Stock R is different than our fixed/patched S-Plus where we've already solved these name-losing problems." Stock S-Plus and R both suffer from losing names when they shouldn't. Since I use both dialects, I've included (ugly) fixes for both. Of course you probably only care about the R part, but I didn't think it would hurt to include both.> Now I'm sure you know from the R-FAQ that there are quite a few > intentional differences between the two dialects of S,Yes, I'm aware of that FAQ. I also just finished porting a large body of code from S-Plus to R a few months ago, so I have a very concrete appreciation of the MANY little S-Plus vs. R differences, many more than are mentioned in that FAQ. Some of those differences are simply arbitrary or accidental, but others are places where S-Plus was basically doing something dumb and the R behavior is better. I have no complaints about this. :) (The converse, where R's behavior is definitely inferior to that of S-Plus, seems to be a lot less common, and are usually more minor.) -- Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/
atp@piskorski.com
2005-Oct-19 19:50 UTC
[Rd] [ subscripting sometimes loses names (PR#8192)
Here is an example of the losing names problem in stock R 2.2.0. Note that below, only stock R packages are loaded, and then I manually source in just my dtk.test.brace.names() testing function, nothing else. Since the list-of-lists output of dtk.test.brace.names() is very lengthy, I've manually cut-and-pasted it into a tabular format to save space and make inspection easier. As you can see, out of its 15 test cases, stock R 2.2.0 fails 4 of them while the other 12 are Ok. Too see what these simple subscripting tests actually DO, please refer to the body of dtk.test.brace.names() from my previous emails above. R : Copyright 2005, The R Foundation for Statistical Computing Version 2.2.0 (2005-10-06 r35749)> search()[1] ".GlobalEnv" "package:methods" "package:graphics" [4] "package:grDevices" "package:datasets" "package:utils" [7] "package:stats" "Autoloads" "package:base"> dtk.test.brace.names(return.results.p=T ,only="all")Ok? Actual Result Desired Result --- ------------------ ------------------ $vec.1 BAD $vec.1[[1]] $vec.1[[2]] a c <NA> a c no 1 3 NA 1 3 NA $diag.1 Ok $diag.1[[1]] $diag.1[[2]] [1] 1 7 13 19 25 [1] 1 7 13 19 25 $diag.2 Ok $diag.2[[1]] $diag.2[[2]] [1] 1 7 13 19 25 [1] 1 7 13 19 25 $df.a.1 Ok $df.a.1[[1]] $df.a.1[[2]] a b a b 4 5 4 5 $df.b.1 BAD $df.b.1[[1]] $df.b.1[[2]] [1] 4 5 a b 4 5 $df.a.2 Ok $df.a.2[[1]] $df.a.2[[2]] c b a c b a 6 5 4 6 5 4 $df.b.2 BAD $df.b.2[[1]] $df.b.2[[2]] [1] 6 5 4 c b a 6 5 4 $df.a.3 Ok $df.a.3[[1]] $df.a.3[[2]] a b a b 3 4 3 4 $df.b.3 BAD $df.b.3[[1]] $df.b.3[[2]] [1] 3 4 a b 3 4 $df.a.4 Ok $df.a.4[[1]] $df.a.4[[2]] col1 col2 col1 col2 2 4 2 4 $df.b.4 Ok $df.b.4[[1]] $df.b.4[[2]] col1 col2 col1 col2 b 2 4 b 2 4 $df.a.5 Ok $df.a.5[[1]] $df.a.5[[2]] col1 col2 col1 col2 2 4 2 4 $df.b.5 Ok $df.b.5[[1]] $df.b.5[[2]] $df.b.5[[1]]$col1 $df.b.5[[2]]$col1 [1] 2 [1] 2 $df.b.5[[1]]$col2 $df.b.5[[2]]$col2 [1] 4 [1] 4 $df.a.6 Ok $df.a.6[[1]] $df.a.6[[2]] col1 col2 col1 col2 b 2 4 b 2 4 $df.b.6 Ok $df.b.6[[1]] $df.b.6[[2]] col1 col2 col1 col2 b 2 4 b 2 4 -- Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/