thr3ads.net - R help - [R] Bug in print for data frames? [Oct 2023]

If this information is useful, please help other people find it:
Share via:

Christian Asseburg

2023-Oct-25 06:18 UTC

[R] Bug in print for data frames?

Hi! I came across this unexpected behaviour in R. First I thought it was a bug
in the assignment operator <- but now I think it's maybe a bug in the way
data frames are being printed. What do you think?

Using R 4.3.1:
> x <- data.frame(A = 1, B = 2, C = 3)
> y <- data.frame(A = 1)
> x  A B C
1 1 2 3> x$B <- y$A # works as expected
> x  A B C
1 1 1 3> x$C <- y[1] # makes C disappear
> x  A B A
1 1 1 1> str(x)'data.frame':   1 obs. of  3 variables:
 $ A: num 1
 $ B: num 1
 $ C:'data.frame':      1 obs. of  1 variable:
  ..$ A: num 1

Why does the print(x) not show "C" as the name of the third element? I
did mess up the data frame (and this was a mistake on my part), but finding the
bug was harder because print(x) didn't show the C any longer.

Thanks. With best wishes -

. . . Christian

Iris Simmons

2023-Oct-26 08:46 UTC

head link

[R] Bug in print for data frames?

I would say this is not an error, but I think what you wrote isn't
what you intended to do anyway.

y[1] is a data.frame which contains only the first column of y, which
you assign to x$C, so now x$C is a data.frame.

R allows data.frame to be plain vectors as well as matrices and
data.frames, basically anything as long as it has the correct length
or nrow.

When the data.frame is formatted for printing, each column C is
formatted then column-bound into another data.frame using
as.data.frame.list, so it takes the name A because that's the name of
the column from y.

I think what you meant to do is x$C <- y[[1]]  ## double brackets
instead of single

On Thu, Oct 26, 2023 at 4:14?AM Christian Asseburg <rhelp at moin.fi>
wrote:>
> Hi! I came across this unexpected behaviour in R. First I thought it was a
bug in the assignment operator <- but now I think it's maybe a bug in the
way data frames are being printed. What do you think?
>
> Using R 4.3.1:
>
> > x <- data.frame(A = 1, B = 2, C = 3)
> > y <- data.frame(A = 1)
> > x
>   A B C
> 1 1 2 3
> > x$B <- y$A # works as expected
> > x
>   A B C
> 1 1 1 3
> > x$C <- y[1] # makes C disappear
> > x
>   A B A
> 1 1 1 1
> > str(x)
> 'data.frame':   1 obs. of  3 variables:
>  $ A: num 1
>  $ B: num 1
>  $ C:'data.frame':      1 obs. of  1 variable:
>   ..$ A: num 1
>
> Why does the print(x) not show "C" as the name of the third
element? I did mess up the data frame (and this was a mistake on my part), but
finding the bug was harder because print(x) didn't show the C any longer.
>
> Thanks. With best wishes -
>
> . . . Christian
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Duncan Murdoch

2023-Oct-26 08:55 UTC

head link

[R] Bug in print for data frames?

On 25/10/2023 2:18 a.m., Christian Asseburg wrote:> Hi! I came across this unexpected behaviour in R. First I thought it was a
bug in the assignment operator <- but now I think it's maybe a bug in the
way data frames are being printed. What do you think?
> 
> Using R 4.3.1:
> 
>> x <- data.frame(A = 1, B = 2, C = 3)
>> y <- data.frame(A = 1)
>> x
>    A B C
> 1 1 2 3
>> x$B <- y$A # works as expected
>> x
>    A B C
> 1 1 1 3
>> x$C <- y[1] # makes C disappear
>> x
>    A B A
> 1 1 1 1
>> str(x)
> 'data.frame':   1 obs. of  3 variables:
>   $ A: num 1
>   $ B: num 1
>   $ C:'data.frame':      1 obs. of  1 variable:
>    ..$ A: num 1
> 
> Why does the print(x) not show "C" as the name of the third
element? I did mess up the data frame (and this was a mistake on my part), but
finding the bug was harder because print(x) didn't show the C any longer.
y[1] is a dataframe with one column, i.e. it is identical to y.  To get 
the result you expected, you should have used y[[1]], to extract column 1.

Since dataframes are lists, you can assign them as columns of other 
dataframes, and you'll create a single column in the result whose rows 
are the columns of the dataframe you're assigning.  This means that

  x$C <- y[1]

replaces the C column of x with a dataframe.  It retains the name C (you 
can see this if you print names(x) ), but since the column contains a 
dataframe, it chooses to use the column name of y when printing.

If you try

  x$D <- x

you'll see it generate new names when printing, but the names within x 
remain as A, B, C, D.

This is a situation where tibbles do a better job than dataframes:  if 
you created x and y as tibbles instead of dataframes and executed your 
code, you'd see this:

   library(tibble)
   x <- tibble(A = 1, B = 2, C = 3)
   y <- tibble(A = 1)
   x$C <- y[1]
   x
   #> # A tibble: 1 ? 3
   #>       A     B   C$A
   #>   <dbl> <dbl> <dbl>
   #> 1     1     2     1

Duncan Murdoch

Rui Barradas

2023-Oct-26 10:42 UTC

head link

[R] Bug in print for data frames?

?s 07:18 de 25/10/2023, Christian Asseburg escreveu:> Hi! I came across this unexpected behaviour in R. First I thought it was a
bug in the assignment operator <- but now I think it's maybe a bug in the
way data frames are being printed. What do you think?
> 
> Using R 4.3.1:
> 
>> x <- data.frame(A = 1, B = 2, C = 3)
>> y <- data.frame(A = 1)
>> x
>    A B C
> 1 1 2 3
>> x$B <- y$A # works as expected
>> x
>    A B C
> 1 1 1 3
>> x$C <- y[1] # makes C disappear
>> x
>    A B A
> 1 1 1 1
>> str(x)
> 'data.frame':   1 obs. of  3 variables:
>   $ A: num 1
>   $ B: num 1
>   $ C:'data.frame':      1 obs. of  1 variable:
>    ..$ A: num 1
> 
> Why does the print(x) not show "C" as the name of the third
element? I did mess up the data frame (and this was a mistake on my part), but
finding the bug was harder because print(x) didn't show the C any longer.
> 
> Thanks. With best wishes -
> 
> . . . Christian
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

To expand on the good answers already given, I will present two other 
example data sets.

Example 1. Imagine that instead of assigning just one column from y to 
x$C you assign two columns. The result is a data.frame column. See what 
is displayed as the columns names.
And unlike what happens with `[`, when asssigning columns 1:2, the 
operator `[[` doesn't work. You will have to extract the columns y$A and 
y$B one by one.



x <- data.frame(A = 1, B = 2, C = 3)
y <- data.frame(A = 1, B = 4)
str(y)
#> 'data.frame':    1 obs. of  2 variables:
#>  $ A: num 1
#>  $ B: num 4

x$C <- y[1:2]
x
#>   A B C.A C.B
#> 1 1 2   1   4

str(x)
#> 'data.frame':    1 obs. of  3 variables:
#>  $ A: num 1
#>  $ B: num 2
#>  $ C:'data.frame':   1 obs. of  2 variables:
#>   ..$ A: num 1
#>   ..$ B: num 4

x[[1:2]]  # doesn't work
#> Error in .subset2(x, i, exact = exact): subscript out of bounds



Example 2. Sometimes it is usefull to get a result like this first and 
then correct the resulting df. For instance, when computing more than 
one summary statistics.

str(agg)  below shows that the result summary stats is a matrix, so you 
have a column-matrix. And once again the displayed names reflect that.

The trick to make the result a df is to extract all but the last column 
as a sub-df, extract the last column's values as a matrix (which it is) 
and then cbind the two together.

cbind is a generic function. Since the first argument to cbind is a 
sub-df, the method called is cbind.data.frame and the result is a df.



df1 <- data.frame(A = rep(c("a", "b", "c"),
5L), X = 1:30)

# the anonymous function computes more than one summary statistics
# note that it returns a named vector
agg <- aggregate(X ~ A, df1, \(x) c(Mean = mean(x), S = sd(x)))
agg
#>   A    X.Mean       X.S
#> 1 a 14.500000  9.082951
#> 2 b 15.500000  9.082951
#> 3 c 16.500000  9.082951

# similar effect as in the OP, The difference is that the last
# column is a matrix, not a data.frame
str(agg)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ A: chr  "a" "b" "c"
#>  $ X: num [1:3, 1:2] 14.5 15.5 16.5 9.08 9.08 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2] "Mean" "S"

# nc is just a convenience, avoids repeated calls to ncol
nc <- ncol(agg)
cbind(agg[-nc], agg[[nc]])
#>   A Mean        S
#> 1 a 14.5 9.082951
#> 2 b 15.5 9.082951
#> 3 c 16.5 9.082951

# all is well
cbind(agg[-nc], agg[[nc]]) |> str()
#> 'data.frame':    3 obs. of  3 variables:
#>  $ A   : chr  "a" "b" "c"
#>  $ Mean: num  14.5 15.5 16.5
#>  $ S   : num  9.08 9.08 9.08



If the anonymous function hadn't returned a named vetor, the new column 
names would have been "1". "2", try it.


Hope this helps,

Rui Barradas



-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

Ivan Krylov

2023-Nov-02 11:27 UTC

head link

[R] Bug in print for data frames?

? Wed, 25 Oct 2023 09:18:26 +0300
"Christian Asseburg" <rhelp at moin.fi> ?????:
> > str(x)  
> 'data.frame':   1 obs. of  3 variables:
>  $ A: num 1
>  $ B: num 1
>  $ C:'data.frame':      1 obs. of  1 variable:
>   ..$ A: num 1
> 
> Why does the print(x) not show "C" as the name of the third
element?
Interesting problem.

print.data.frame() calls format.data.frame() to prepare its argument
for printing, which in turn calls as.data.frame.list() to reconstruct a
data.frame from the formatted arguments, which in turn uses
data.frame() to actually construct the object.

data.frame() is able to return combined column names, but only if the
inner data.frame has more than one column:

names(data.frame(A = 1:3, B = data.frame(C = 4:6, D = 7:9)))
# [1] "A"   "B.C" "B.D"
names(data.frame(A = 1:3, B = data.frame(C = 4:6)))
# [1] "A" "C"

This matches the behaviour documented in ?data.frame:
>> For a named or unnamed matrix/list/data frame argument that contains
>> a single column, the column name in the result is the column name in
>> the argument.
Still, changing the presentational code like print.data.frame() or
format.data.frame() could be safe. I've tried writing a patch for
format.data.frame(), but it looks clumsy and breaks regression tests
(that do actually check capture.output()):

--- src/library/base/R/format.R (revision 85459)
+++ src/library/base/R/format.R (working copy)
@@ -243,8 +243,16 @@
     if(!nc) return(x) # 0 columns: evade problems, notably for nrow() > 0
     nr <- .row_names_info(x, 2L)
     rval <- vector("list", nc)
-    for(i in seq_len(nc))
+    for(i in seq_len(nc)) {
        rval[[i]] <- format(x[[i]], ..., justify = justify)
+       # avoid data.frame(foo = data.frame(bar = ...)) overwriting
+       # the single column name
+       if (
+           identical(ncol(rval[[i]]), 1L) &&
+           !is.null(colnames(rval[[i]])) &&
+           colnames(rval[[i]]) != ''
+       ) colnames(rval[[i]]) <- paste(names(x)[[i]], colnames(rval[[i]]),
sep = '.')
+    }
     lens <- vapply(rval, NROW, 1)
     if(any(lens != nr)) { # corrupt data frame, must have at least one column
        warning("corrupt data frame: columns will be truncated or
        padded with NAs")

Is it worth changing the behaviour of {print,format}.data.frame() (and
fixing the regression tests to accept the new behaviour), or would that
break too much?

-- 
Best regards,
Ivan

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Oct 2023 - Bug in print for data frames?

[R] Bug in print for data frames?

[R] Bug in print for data frames?

[R] Bug in print for data frames?

[R] Bug in print for data frames?

[R] Bug in print for data frames?

Seemingly Similar Threads