Hi, I'm using data.frame(..., check.names=FALSE), because I want to create a data frame with duplicated column names (in the real life you can get such data frame as the result of an SQL query): > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE) > df aa aa 1 1 9 2 2 8 3 3 7 4 4 6 5 5 5 Why is [.data.frame changing my column names? > df[1:3, ] aa aa.1 1 1 9 2 2 8 3 3 7 How can this be avoided? Thanks! H.
On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote:> Hi, > > I'm using data.frame(..., check.names=FALSE), because I want to create > a data frame with duplicated column names (in the real life you can get such > data frame as the result of an SQL query): > > > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE) > > df > aa aa > 1 1 9 > 2 2 8 > 3 3 7 > 4 4 6 > 5 5 5 > > Why is [.data.frame changing my column names? > > > df[1:3, ] > aa aa.1 > 1 1 9 > 2 2 8 > 3 3 7 > > How can this be avoided? Thanks! > > H.Herve, I had not seen a reply to your post, but you can review the code for "[.data.frame" by using: getAnywhere("[.data.frame") and see where there are checks for duplicate column names in the function. That is going to be the default behavior for data frame subsetting/extraction and in fact is noted in the 'ONEWS' file for R version 1.8.0: - Subsetting a data frame can no longer produce duplicate column names. So it has been around for some time (October of 2003). In terms of avoiding it, I suspect that you would have to create your own version of the function, perhaps with an additional argument that enables/disables that duplicate column name checks. I have not however considered the broader functional implications of doing so however, so be vewwy vewwy careful here. HTH, Marc Schwartz
hpages at fhcrc.org
2007-May-18 02:34 UTC
[Rd] Unexpected alteration of data frame column names
Hi, Thanks to both for your answers! Quoting Marc Schwartz <marc_schwartz at comcast.net>:> On Thu, 2007-05-17 at 10:54 +0100, Prof Brian Ripley wrote: > > To add to Marc's detective work. ?"[.data.frame" does say > > > > If '[' returns a data frame it will have unique (and non-missing) > > row names, if necessary transforming the row names using > > 'make.unique'. Similarly, column names will be transformed (if > > columns are selected more than once). > > > > Now, an 'e.g.' in the parenthetical remark might make this clearer (since > > added), but I don't see why this was 'unexpected' (or why this is an issueIt all depends whether you care about consistency or not. Personnally I do. Yes documenting inconsistencies is better than nothing but is not always enough to make the language predictable (see below). So, according to ?"[.data.frame", column names will be transformed (if columns are selected more than once). OK. Personnally, I can see ony 2 reasonable semantics for 'df[ ]' or 'df[ , ]': (1) either it makes an exact copy of your data frame (and this is not only true for data frames: unless documented otherwise one can expect x[] to be the same as x), (2) either you consider that it is equivalent to 'df[names(df)]' for the former and to 'df[ , names(df)]' for the latter. So it seems that for 'df[ ]', we have semantic (1):> df=data.frame(aa=LETTERS[1:3],bb=3:5,aa=7:5,check.names=FALSE)> dfaa bb aa 1 A 3 7 2 B 4 6 3 C 5 5> df[]aa bb aa 1 A 3 7 2 B 4 6 3 C 5 5 Since we have duplicated colnames, 'df[names(df)]' will select the first column twice and rename it (as documented):> df[names(df)]aa bb aa.1 1 A 3 A 2 B 4 B 3 C 5 C Good! Now with 'df[ , ]', I still maintain that this is unexpected:> df[ , ]aa bb aa.1 1 A 3 7 2 B 4 6 3 C 5 5 This is a mix of semantic (1) and semantic (2): 3rd column has been renamed but its data are the _original_ data. With semantic (2), you would get this:> df[ , names(df)]aa bb aa.1 1 A 3 A 2 B 4 B 3 C 5 C Also the fact that 'df[something]' doesn't behave like 'df[,something]' is IMHO another inconsistency... Hope you don't mind if I put this back on R-devel which is probably the right place to discuss the language semantic. Cheers, H.> > > for R-devel). > > > > On Tue, 15 May 2007, Marc Schwartz wrote: > > > > > On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote: > > >> Hi, > > >> > > >> I'm using data.frame(..., check.names=FALSE), because I want to create > > >> a data frame with duplicated column names (in the real life you can get > such > > >> data frame as the result of an SQL query): > > > > That depends on the interface you are using. > > > > >> > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE) > > >> > df > > >> aa aa > > >> 1 1 9 > > >> 2 2 8 > > >> 3 3 7 > > >> 4 4 6 > > >> 5 5 5 > > >> > > >> Why is [.data.frame changing my column names? > > >> > > >> > df[1:3, ] > > >> aa aa.1 > > >> 1 1 9 > > >> 2 2 8 > > >> 3 3 7 > > >> > > >> How can this be avoided? Thanks! > > >> > > >> H. > > > > > > Herve, > > > > > > I had not seen a reply to your post, but you can review the code for > > > "[.data.frame" by using: > > > > > > getAnywhere("[.data.frame") > > > > > > and see where there are checks for duplicate column names in the > > > function. > > > > > > That is going to be the default behavior for data frame > > > subsetting/extraction and in fact is noted in the 'ONEWS' file for R > > > version 1.8.0: > > > > > > - Subsetting a data frame can no longer produce duplicate > > > column names. > > > > > > So it has been around for some time (October of 2003). > > > > > > In terms of avoiding it, I suspect that you would have to create your > > > own version of the function, perhaps with an additional argument that > > > enables/disables that duplicate column name checks. > > > > > > I have not however considered the broader functional implications of > > > doing so however, so be vewwy vewwy careful here. > > > > Namespace issues would mean that your version would hardly ever be used. > > I suspected that namespaces might be an issue here, but had not pursued > that line of thinking beyond an initial 'gut feel'. > > Thanks, > > Marc > > >