Hi! I came across this unexpected behaviour in R. First I thought it was a bug in the assignment operator <- but now I think it's maybe a bug in the way data frames are being printed. What do you think? Using R 4.3.1:> x <- data.frame(A = 1, B = 2, C = 3) > y <- data.frame(A = 1) > xA B C 1 1 2 3> x$B <- y$A # works as expected > xA B C 1 1 1 3> x$C <- y[1] # makes C disappear > xA B A 1 1 1 1> str(x)'data.frame': 1 obs. of 3 variables: $ A: num 1 $ B: num 1 $ C:'data.frame': 1 obs. of 1 variable: ..$ A: num 1 Why does the print(x) not show "C" as the name of the third element? I did mess up the data frame (and this was a mistake on my part), but finding the bug was harder because print(x) didn't show the C any longer. Thanks. With best wishes - . . . Christian
I would say this is not an error, but I think what you wrote isn't what you intended to do anyway. y[1] is a data.frame which contains only the first column of y, which you assign to x$C, so now x$C is a data.frame. R allows data.frame to be plain vectors as well as matrices and data.frames, basically anything as long as it has the correct length or nrow. When the data.frame is formatted for printing, each column C is formatted then column-bound into another data.frame using as.data.frame.list, so it takes the name A because that's the name of the column from y. I think what you meant to do is x$C <- y[[1]] ## double brackets instead of single On Thu, Oct 26, 2023 at 4:14?AM Christian Asseburg <rhelp at moin.fi> wrote:> > Hi! I came across this unexpected behaviour in R. First I thought it was a bug in the assignment operator <- but now I think it's maybe a bug in the way data frames are being printed. What do you think? > > Using R 4.3.1: > > > x <- data.frame(A = 1, B = 2, C = 3) > > y <- data.frame(A = 1) > > x > A B C > 1 1 2 3 > > x$B <- y$A # works as expected > > x > A B C > 1 1 1 3 > > x$C <- y[1] # makes C disappear > > x > A B A > 1 1 1 1 > > str(x) > 'data.frame': 1 obs. of 3 variables: > $ A: num 1 > $ B: num 1 > $ C:'data.frame': 1 obs. of 1 variable: > ..$ A: num 1 > > Why does the print(x) not show "C" as the name of the third element? I did mess up the data frame (and this was a mistake on my part), but finding the bug was harder because print(x) didn't show the C any longer. > > Thanks. With best wishes - > > . . . Christian > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 25/10/2023 2:18 a.m., Christian Asseburg wrote:> Hi! I came across this unexpected behaviour in R. First I thought it was a bug in the assignment operator <- but now I think it's maybe a bug in the way data frames are being printed. What do you think? > > Using R 4.3.1: > >> x <- data.frame(A = 1, B = 2, C = 3) >> y <- data.frame(A = 1) >> x > A B C > 1 1 2 3 >> x$B <- y$A # works as expected >> x > A B C > 1 1 1 3 >> x$C <- y[1] # makes C disappear >> x > A B A > 1 1 1 1 >> str(x) > 'data.frame': 1 obs. of 3 variables: > $ A: num 1 > $ B: num 1 > $ C:'data.frame': 1 obs. of 1 variable: > ..$ A: num 1 > > Why does the print(x) not show "C" as the name of the third element? I did mess up the data frame (and this was a mistake on my part), but finding the bug was harder because print(x) didn't show the C any longer.y[1] is a dataframe with one column, i.e. it is identical to y. To get the result you expected, you should have used y[[1]], to extract column 1. Since dataframes are lists, you can assign them as columns of other dataframes, and you'll create a single column in the result whose rows are the columns of the dataframe you're assigning. This means that x$C <- y[1] replaces the C column of x with a dataframe. It retains the name C (you can see this if you print names(x) ), but since the column contains a dataframe, it chooses to use the column name of y when printing. If you try x$D <- x you'll see it generate new names when printing, but the names within x remain as A, B, C, D. This is a situation where tibbles do a better job than dataframes: if you created x and y as tibbles instead of dataframes and executed your code, you'd see this: library(tibble) x <- tibble(A = 1, B = 2, C = 3) y <- tibble(A = 1) x$C <- y[1] x #> # A tibble: 1 ? 3 #> A B C$A #> <dbl> <dbl> <dbl> #> 1 1 2 1 Duncan Murdoch
?s 07:18 de 25/10/2023, Christian Asseburg escreveu:> Hi! I came across this unexpected behaviour in R. First I thought it was a bug in the assignment operator <- but now I think it's maybe a bug in the way data frames are being printed. What do you think? > > Using R 4.3.1: > >> x <- data.frame(A = 1, B = 2, C = 3) >> y <- data.frame(A = 1) >> x > A B C > 1 1 2 3 >> x$B <- y$A # works as expected >> x > A B C > 1 1 1 3 >> x$C <- y[1] # makes C disappear >> x > A B A > 1 1 1 1 >> str(x) > 'data.frame': 1 obs. of 3 variables: > $ A: num 1 > $ B: num 1 > $ C:'data.frame': 1 obs. of 1 variable: > ..$ A: num 1 > > Why does the print(x) not show "C" as the name of the third element? I did mess up the data frame (and this was a mistake on my part), but finding the bug was harder because print(x) didn't show the C any longer. > > Thanks. With best wishes - > > . . . Christian > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, To expand on the good answers already given, I will present two other example data sets. Example 1. Imagine that instead of assigning just one column from y to x$C you assign two columns. The result is a data.frame column. See what is displayed as the columns names. And unlike what happens with `[`, when asssigning columns 1:2, the operator `[[` doesn't work. You will have to extract the columns y$A and y$B one by one. x <- data.frame(A = 1, B = 2, C = 3) y <- data.frame(A = 1, B = 4) str(y) #> 'data.frame': 1 obs. of 2 variables: #> $ A: num 1 #> $ B: num 4 x$C <- y[1:2] x #> A B C.A C.B #> 1 1 2 1 4 str(x) #> 'data.frame': 1 obs. of 3 variables: #> $ A: num 1 #> $ B: num 2 #> $ C:'data.frame': 1 obs. of 2 variables: #> ..$ A: num 1 #> ..$ B: num 4 x[[1:2]] # doesn't work #> Error in .subset2(x, i, exact = exact): subscript out of bounds Example 2. Sometimes it is usefull to get a result like this first and then correct the resulting df. For instance, when computing more than one summary statistics. str(agg) below shows that the result summary stats is a matrix, so you have a column-matrix. And once again the displayed names reflect that. The trick to make the result a df is to extract all but the last column as a sub-df, extract the last column's values as a matrix (which it is) and then cbind the two together. cbind is a generic function. Since the first argument to cbind is a sub-df, the method called is cbind.data.frame and the result is a df. df1 <- data.frame(A = rep(c("a", "b", "c"), 5L), X = 1:30) # the anonymous function computes more than one summary statistics # note that it returns a named vector agg <- aggregate(X ~ A, df1, \(x) c(Mean = mean(x), S = sd(x))) agg #> A X.Mean X.S #> 1 a 14.500000 9.082951 #> 2 b 15.500000 9.082951 #> 3 c 16.500000 9.082951 # similar effect as in the OP, The difference is that the last # column is a matrix, not a data.frame str(agg) #> 'data.frame': 3 obs. of 2 variables: #> $ A: chr "a" "b" "c" #> $ X: num [1:3, 1:2] 14.5 15.5 16.5 9.08 9.08 ... #> ..- attr(*, "dimnames")=List of 2 #> .. ..$ : NULL #> .. ..$ : chr [1:2] "Mean" "S" # nc is just a convenience, avoids repeated calls to ncol nc <- ncol(agg) cbind(agg[-nc], agg[[nc]]) #> A Mean S #> 1 a 14.5 9.082951 #> 2 b 15.5 9.082951 #> 3 c 16.5 9.082951 # all is well cbind(agg[-nc], agg[[nc]]) |> str() #> 'data.frame': 3 obs. of 3 variables: #> $ A : chr "a" "b" "c" #> $ Mean: num 14.5 15.5 16.5 #> $ S : num 9.08 9.08 9.08 If the anonymous function hadn't returned a named vetor, the new column names would have been "1". "2", try it. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com
? Wed, 25 Oct 2023 09:18:26 +0300 "Christian Asseburg" <rhelp at moin.fi> ?????:> > str(x) > 'data.frame': 1 obs. of 3 variables: > $ A: num 1 > $ B: num 1 > $ C:'data.frame': 1 obs. of 1 variable: > ..$ A: num 1 > > Why does the print(x) not show "C" as the name of the third element?Interesting problem. print.data.frame() calls format.data.frame() to prepare its argument for printing, which in turn calls as.data.frame.list() to reconstruct a data.frame from the formatted arguments, which in turn uses data.frame() to actually construct the object. data.frame() is able to return combined column names, but only if the inner data.frame has more than one column: names(data.frame(A = 1:3, B = data.frame(C = 4:6, D = 7:9))) # [1] "A" "B.C" "B.D" names(data.frame(A = 1:3, B = data.frame(C = 4:6))) # [1] "A" "C" This matches the behaviour documented in ?data.frame:>> For a named or unnamed matrix/list/data frame argument that contains >> a single column, the column name in the result is the column name in >> the argument.Still, changing the presentational code like print.data.frame() or format.data.frame() could be safe. I've tried writing a patch for format.data.frame(), but it looks clumsy and breaks regression tests (that do actually check capture.output()): --- src/library/base/R/format.R (revision 85459) +++ src/library/base/R/format.R (working copy) @@ -243,8 +243,16 @@ if(!nc) return(x) # 0 columns: evade problems, notably for nrow() > 0 nr <- .row_names_info(x, 2L) rval <- vector("list", nc) - for(i in seq_len(nc)) + for(i in seq_len(nc)) { rval[[i]] <- format(x[[i]], ..., justify = justify) + # avoid data.frame(foo = data.frame(bar = ...)) overwriting + # the single column name + if ( + identical(ncol(rval[[i]]), 1L) && + !is.null(colnames(rval[[i]])) && + colnames(rval[[i]]) != '' + ) colnames(rval[[i]]) <- paste(names(x)[[i]], colnames(rval[[i]]), sep = '.') + } lens <- vapply(rval, NROW, 1) if(any(lens != nr)) { # corrupt data frame, must have at least one column warning("corrupt data frame: columns will be truncated or padded with NAs") Is it worth changing the behaviour of {print,format}.data.frame() (and fixing the regression tests to accept the new behaviour), or would that break too much? -- Best regards, Ivan