thr3ads.net - R devel - [Rd] Unexpected alteration of data frame column names [May 2007]

If this information is useful, please help other people find it:
Share via:

Herve Pages

2007-May-15 06:59 UTC

[Rd] Unexpected alteration of data frame column names

Hi,

I'm using data.frame(..., check.names=FALSE), because I want to create
a data frame with duplicated column names (in the real life you can get such
data frame as the result of an SQL query):

  > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE)
  > df
    aa aa
  1  1  9
  2  2  8
  3  3  7
  4  4  6
  5  5  5

Why is [.data.frame changing my column names?

  > df[1:3, ]
    aa aa.1
  1  1    9
  2  2    8
  3  3    7

How can this be avoided? Thanks!

H.

Marc Schwartz

2007-May-15 18:25 UTC

head link

[Rd] Unexpected alteration of data frame column names

On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote:> Hi,
> 
> I'm using data.frame(..., check.names=FALSE), because I want to create
> a data frame with duplicated column names (in the real life you can get
such
> data frame as the result of an SQL query):
> 
>   > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE)
>   > df
>     aa aa
>   1  1  9
>   2  2  8
>   3  3  7
>   4  4  6
>   5  5  5
> 
> Why is [.data.frame changing my column names?
> 
>   > df[1:3, ]
>     aa aa.1
>   1  1    9
>   2  2    8
>   3  3    7
> 
> How can this be avoided? Thanks!
> 
> H.
Herve,

I had not seen a reply to your post, but you can review the code for
"[.data.frame" by using:

  getAnywhere("[.data.frame")

and see where there are checks for duplicate column names in the
function.

That is going to be the default behavior for data frame
subsetting/extraction and in fact is noted in the 'ONEWS' file for R
version 1.8.0:

 - Subsetting a data frame can no longer produce duplicate
   column names.

So it has been around for some time (October of 2003).

In terms of avoiding it, I suspect that you would have to create your
own version of the function, perhaps with an additional argument that
enables/disables that duplicate column name checks.

I have not however considered the broader functional implications of
doing so however, so be vewwy vewwy careful here.

HTH,

Marc Schwartz

hpages at fhcrc.org

2007-May-18 02:34 UTC

head link

[Rd] Unexpected alteration of data frame column names

Hi,

Thanks to both for your answers!

Quoting Marc Schwartz <marc_schwartz at comcast.net>:
> On Thu, 2007-05-17 at 10:54 +0100, Prof Brian Ripley wrote:
> > To add to Marc's detective work. ?"[.data.frame" does
say
> > 
> >       If '[' returns a data frame it will have unique (and
non-missing)
> >       row names, if necessary transforming the row names using
> >       'make.unique'.  Similarly, column names will be
transformed (if
> >       columns are selected more than once).
> > 
> > Now, an 'e.g.' in the parenthetical remark might make this
clearer (since
> > added), but I don't see why this was 'unexpected' (or why
this is an issue
It all depends whether you care about consistency or not. Personnally
I do. Yes documenting inconsistencies is better than nothing but is not
always enough to make the language predictable (see below).

So, according to ?"[.data.frame", column names will be transformed (if
columns are selected more than once). OK.

Personnally, I can see ony 2 reasonable semantics for 'df[ ]' or
'df[ , ]':

  (1) either it makes an exact copy of your data frame (and this is not
      only true for data frames: unless documented otherwise one can
      expect x[] to be the same as x),

  (2) either you consider that it is equivalent to 'df[names(df)]' for
      the former and to 'df[ , names(df)]' for the latter.

So it seems that for 'df[ ]', we have semantic (1):
 > df=data.frame(aa=LETTERS[1:3],bb=3:5,aa=7:5,check.names=FALSE)
> df  aa bb aa
1  A  3  7
2  B  4  6
3  C  5  5
> df[]  aa bb aa
1  A  3  7
2  B  4  6
3  C  5  5

Since we have duplicated colnames, 'df[names(df)]' will select
the first column twice and rename it (as documented):
> df[names(df)]  aa bb aa.1
1  A  3    A
2  B  4    B
3  C  5    C

Good!

Now with 'df[ , ]', I still maintain that this is unexpected:
> df[ , ]  aa bb aa.1
1  A  3    7
2  B  4    6
3  C  5    5

This is a mix of semantic (1) and semantic (2): 3rd column has been renamed
but its data are the _original_ data. With semantic (2), you would get this:
 > df[ , names(df)]  aa bb aa.1
1  A  3    A
2  B  4    B
3  C  5    C

Also the fact that 'df[something]' doesn't behave like
'df[,something]'
is IMHO another inconsistency...

Hope you don't mind if I put this back on R-devel which is probably
the right place to discuss the language semantic.

Cheers,
H.
  > 
> > for R-devel).
> > 
> > On Tue, 15 May 2007, Marc Schwartz wrote:
> > 
> > > On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote:
> > >> Hi,
> > >>
> > >> I'm using data.frame(..., check.names=FALSE), because I
want to create
> > >> a data frame with duplicated column names (in the real life
you can get
> such
> > >> data frame as the result of an SQL query):
> > 
> > That depends on the interface you are using.
> > 
> > >>  > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE)
> > >>  > df
> > >>     aa aa
> > >>   1  1  9
> > >>   2  2  8
> > >>   3  3  7
> > >>   4  4  6
> > >>   5  5  5
> > >>
> > >> Why is [.data.frame changing my column names?
> > >>
> > >>  > df[1:3, ]
> > >>     aa aa.1
> > >>   1  1    9
> > >>   2  2    8
> > >>   3  3    7
> > >>
> > >> How can this be avoided? Thanks!
> > >>
> > >> H.
> > >
> > > Herve,
> > >
> > > I had not seen a reply to your post, but you can review the code
for
> > > "[.data.frame" by using:
> > >
> > >  getAnywhere("[.data.frame")
> > >
> > > and see where there are checks for duplicate column names in the
> > > function.
> > >
> > > That is going to be the default behavior for data frame
> > > subsetting/extraction and in fact is noted in the 'ONEWS'
file for R
> > > version 1.8.0:
> > >
> > > - Subsetting a data frame can no longer produce duplicate
> > >   column names.
> > >
> > > So it has been around for some time (October of 2003).
> > >
> > > In terms of avoiding it, I suspect that you would have to create
your
> > > own version of the function, perhaps with an additional argument
that
> > > enables/disables that duplicate column name checks.
> > >
> > > I have not however considered the broader functional implications
of
> > > doing so however, so be vewwy vewwy careful here.
> > 
> > Namespace issues would mean that your version would hardly ever be
used.
> 
> I suspected that namespaces might be an issue here, but had not pursued
> that line of thinking beyond an initial 'gut feel'.
> 
> Thanks,
> 
> Marc
> 
> 
>

Seemingly Similar Threads

Search for more apparently analagous threads

R devel - May 2007 - Unexpected alteration of data frame column names

[Rd] Unexpected alteration of data frame column names

[Rd] Unexpected alteration of data frame column names

[Rd] Unexpected alteration of data frame column names

Seemingly Similar Threads