Davis Vaughan
2021-Feb-14 14:50 UTC
[Rd] Corrupt internal row names when creating a data.frame with `attributes<-`
Hi all, I believe that the internal row names object created at this line in `row_names_gets()` should be using `-n`, not `n`. https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71 This can currently generate corrupt internal row names when using `attributes<-` or `structure()`, which calls `attributes<-`. # internal row names are typically `c(NA, -n)` df <- data.frame(x = 1:3) .row_names_info(df, type = 0L) #> [1] NA -3 # using `attributes()` materializes their non-internal form attrs <- attributes(df) attrs #> $names #> [1] "x" #> #> $class #> [1] "data.frame" #> #> $row.names #> [1] 1 2 3 # let's make a data frame from scratch with `attributes<-` data <- list(x = 1:3) attributes(data) <- attrs # oh no! .row_names_info(data, type = 0L) #> [1] NA 3 # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise # internal row names are not attempted to be created in the C level # `row_names_gets()` Thanks, Davis [[alternative HTML version deleted]]
Kevin Ushey
2021-Feb-16 19:05 UTC
[Rd] Corrupt internal row names when creating a data.frame with `attributes<-`
Strictly speaking, I don't think this is a "corrupt" representation, given that any APIs used to access that internal representation will call abs() on the row count encoded within. At least, as far as I can tell, there aren't any adverse downstream effects from having the row names attribute encoded with this particular internal representation. On the other hand, the documentation in ?.row_names_info states, for the 'type' argument: integer. Currently type = 0 returns the internal "row.names" attribute (possibly NULL), type = 2 the number of rows implied by the attribute, and type = 1 the latter with a negative sign for ?automatic? row names. so one could argue that it's incorrect in light of that documentation (the row names are "automatic", but the row count is not marked with a negative sign). Or perhaps this is a different "type" of internal automatic row name, since it was generated from an already-existing integer sequence rather than "automatically" in a call to data.frame(). Kevin On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan <davis at rstudio.com> wrote:> > Hi all, > > I believe that the internal row names object created at this line in > `row_names_gets()` should be using `-n`, not `n`. > https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71 > > This can currently generate corrupt internal row names when using > `attributes<-` or `structure()`, which calls `attributes<-`. > > # internal row names are typically `c(NA, -n)` > df <- data.frame(x = 1:3) > .row_names_info(df, type = 0L) > #> [1] NA -3 > > # using `attributes()` materializes their non-internal form > attrs <- attributes(df) > attrs > #> $names > #> [1] "x" > #> > #> $class > #> [1] "data.frame" > #> > #> $row.names > #> [1] 1 2 3 > > # let's make a data frame from scratch with `attributes<-` > data <- list(x = 1:3) > attributes(data) <- attrs > > # oh no! > .row_names_info(data, type = 0L) > #> [1] NA 3 > > # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise > # internal row names are not attempted to be created in the C level > # `row_names_gets()` > > Thanks, > Davis > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel