thr3ads.net - R devel - [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Scott Ritchie

2018-Feb-18 20:19 UTC

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Thanks Gabriel,

I think your suggested approach is 100% backwards compatible

Currently in the case of duplicate column names only the first can be
indexed by its name. This will always be the column appearing in by.x,
meaning the column in y with the same name cannot be accessed. Appending
".y" (suffixes[2L]) to this column means it can now be accessed, while
keeping the current behaviour of making the key columns always accessible
by using the names provided to by.x.

I've attached a new patch that has this behaviour.

Best,

Scott

On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
wrote:
> It seems like there is a way that is backwards compatible-ish in the sense
> mentioned and still has the (arguably, but a good argument I think) better
> behavior:
>
> if by.x is 'name', (AND by.y is not also 'name'), then
x's 'name' column
> is called name and y's 'name' column (not used int he merge) is
changed to
> name.y.
>
> Now of course this would still change output, but it would change it to
> something I think would be better, while retaining the 'merge columns
> retain their exact names' mechanic as documented.
>
> ~G
>
> On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at
gmail.com>
> wrote:
>
>> Thanks Duncan and Frederick,
>>
>> I suspected as much - there doesn't appear to be any reason why
conflicts
>> between by.x and names(y) shouldn't and cannot be checked, but I
can see
>> how this might be more trouble than its worth given it potentially may
>> break downstream packages (i.e. any cases where this occurs but they
>> expect
>> the name of the key column(s) to remain the same).
>>
>> Best,
>>
>> Scott
>>
>> On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at
gmail.com>
>> wrote:
>>
>> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
>> >
>> >> Hi Scott,
>> >>
>> >> Thanks for the patch. I'm not really involved in R
development; it
>> >> will be up to someone in the R core team to apply it. I would
hazard
>> >> to say that even if correct (I haven't checked), it will
not be
>> >> applied because the change might break existing code. For
example it
>> >> seems like reasonable code might easily assume that a column
with the
>> >> same name as "by.x" exists in the output of
'merge'. That's just my
>> >> best guess... I don't participate on here often.
>> >>
>> >
>> >
>> > I think you're right.  If I were still a member of R Core, I
would want
>> to
>> > test this against all packages on CRAN and Bioconductor, and since
that
>> > test takes a couple of days to run on my laptop, I'd probably
never get
>> > around to it.
>> >
>> > There are lots of cases where "I would have done that
differently", but
>> > most of them are far too much trouble to change now that R is more
than
>> 20
>> > years old.  And in many cases it will turn out that the way R does
it
>> > actually does make more sense than the way I would have done it.
>> >
>> > Duncan Murdoch
>> >
>> >
>> >
>> >> Cheers,
>> >>
>> >> Frederick
>> >>
>> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>> >>
>> >>> The attached patch.diff will make merge.data.frame()
append the
>> suffixes
>> >>> to
>> >>> columns with common names between by.x and names(y).
>> >>>
>> >>> Best,
>> >>>
>> >>> Scott Ritchie
>> >>>
>> >>> On 17 February 2018 at 11:15, Scott Ritchie
<s.ritchie73 at gmail.com>
>> >>> wrote:
>> >>>
>> >>> Hi Frederick,
>> >>>>
>> >>>> I would expect that any duplicate names in the
resulting data.frame
>> >>>> would
>> >>>> have the suffixes appended to them, regardless of
whether or not they
>> >>>> are
>> >>>> used as the join key. So in my example I would expect
"names.x" and
>> >>>> "names.y" to indicate their source
data.frame.
>> >>>>
>> >>>> While careful reading of the documentation reveals
this is not the
>> >>>> case, I
>> >>>> would argue the intent of the suffixes functionality
should equally
>> be
>> >>>> applied to this type of case.
>> >>>>
>> >>>> If you agree this would be useful, I'm happy to
write a patch for
>> >>>> merge.data.frame that will add suffixes in this case -
I intend to do
>> >>>> the
>> >>>> same for merge.data.table in the data.table package
where I initially
>> >>>> encountered the edge case.
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Scott
>> >>>>
>> >>>> On 17 February 2018 at 03:53, <frederik at
ofb.net> wrote:
>> >>>>
>> >>>> Hi Scott,
>> >>>>>
>> >>>>> It seems like reasonable behavior to me. What
result would you
>> expect?
>> >>>>> That the second "name" should be called
"name.y"?
>> >>>>>
>> >>>>> The "merge" documentation says:
>> >>>>>
>> >>>>>      If the columns in the data frames not used in
merging have any
>> >>>>>      common names, these have ?suffixes?
(?".x"? and ?".y"? by
>> default)
>> >>>>>      appended to try to make the names of the
result unique.
>> >>>>>
>> >>>>> Since the first "name" column was used
in merging, leaving both
>> >>>>> without a suffix seems consistent with the
documentation...
>> >>>>>
>> >>>>> Frederick
>> >>>>>
>> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott
Ritchie wrote:
>> >>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I was unable to find a bug report for this
with a cursory search,
>> but
>> >>>>>>
>> >>>>> would
>> >>>>>
>> >>>>>> like clarification if this is intended or
unavoidable behaviour:
>> >>>>>>
>> >>>>>> ```{r}
>> >>>>>> # Create example data.frames
>> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
"Lex"),
>> >>>>>>                        sex=c("F",
"M", "F", "M"),
>> >>>>>>                        age=c(41, 43, 36, 51))
>> >>>>>> children <-
data.frame(parent=c("Sarah", "Max", "Qin"),
>> >>>>>>                        
name=c("Oliver", "Sebastian", "Kai-lee"),
>> >>>>>>                         sex=c("M",
"M", "F"),
>> >>>>>>                         age=c(5,8,7))
>> >>>>>>
>> >>>>>> # Merge() creates a duplicated
"name" column:
>> >>>>>> merge(parents, children, by.x =
"name", by.y = "parent")
>> >>>>>> ```
>> >>>>>>
>> >>>>>> Output:
>> >>>>>> ```
>> >>>>>>     name sex.x age.x      name sex.y age.y
>> >>>>>> 1   Max     M    43 Sebastian     M     8
>> >>>>>> 2   Qin     F    36   Kai-lee     F     7
>> >>>>>> 3 Sarah     F    41    Oliver     M     5
>> >>>>>> Warning message:
>> >>>>>> In merge.data.frame(parents, children, by.x =
"name", by.y >> >>>>>> "parent") :
>> >>>>>>    column name ?name? is duplicated in the
result
>> >>>>>> ```
>> >>>>>>
>> >>>>>> Kind Regards,
>> >>>>>>
>> >>>>>> Scott Ritchie
>> >>>>>>
>> >>>>>>        [[alternative HTML version deleted]]
>> >>>>>>
>> >>>>>> ______________________________________________
>> >>>>>> R-devel at r-project.org mailing list
>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >> Index: src/library/base/R/merge.R
>> >>>
==================================================================>>
>>> --- src/library/base/R/merge.R  (revision 74264)
>> >>> +++ src/library/base/R/merge.R  (working copy)
>> >>> @@ -157,6 +157,15 @@
>> >>>           }
>> >>>             if(has.common.nms) names(y) <- nm.y
>> >>> +        ## If by.x %in% names(y) then duplicate column
names still
>> >>> arise,
>> >>> +        ## apply suffixes to these
>> >>> +        dupe.keyx <- intersect(nm.by, names(y))
>> >>> +        if(length(dupe.keyx)) {
>> >>> +          if(nzchar(suffixes[1L]))
>> >>> +            names(x)[match(dupe.keyx, names(x), 0L)]
<-
>> >>> paste(dupe.keyx, suffixes[1L], sep="")
>> >>> +          if(nzchar(suffixes[2L]))
>> >>> +            names(y)[match(dupe.keyx, names(y), 0L)]
<-
>> >>> paste(dupe.keyx, suffixes[2L], sep="")
>> >>> +        }
>> >>>           nm <- c(names(x), names(y))
>> >>>           if(any(d <- duplicated(nm)))
>> >>>               if(sum(d) > 1L)
>> >>>
>> >>
>> >> ______________________________________________
>> >> R-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >>
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Gabriel Becker, PhD
> Scientist (Bioinformatics)
> Genentech Research
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.diff
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20180219/6c467cb5/attachment.ksh>

frederik at ofb.net

2018-Feb-20 21:23 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Hi Scott,

I think that's a good idea and I tried your patch on my copy of the
repository. But it looks to me like the recent patch is identical to
the previous one, can you confirm this?

Frederick

On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie
wrote:> Thanks Gabriel,
> 
> I think your suggested approach is 100% backwards compatible
> 
> Currently in the case of duplicate column names only the first can be
> indexed by its name. This will always be the column appearing in by.x,
> meaning the column in y with the same name cannot be accessed. Appending
> ".y" (suffixes[2L]) to this column means it can now be accessed,
while
> keeping the current behaviour of making the key columns always accessible
> by using the names provided to by.x.
> 
> I've attached a new patch that has this behaviour.
> 
> Best,
> 
> Scott
> 
> On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at
ucdavis.edu> wrote:
> 
> > It seems like there is a way that is backwards compatible-ish in the
sense
> > mentioned and still has the (arguably, but a good argument I think)
better
> > behavior:
> >
> > if by.x is 'name', (AND by.y is not also 'name'), then
x's 'name' column
> > is called name and y's 'name' column (not used int he
merge) is changed to
> > name.y.
> >
> > Now of course this would still change output, but it would change it
to
> > something I think would be better, while retaining the 'merge
columns
> > retain their exact names' mechanic as documented.
> >
> > ~G
> >
> > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at
gmail.com>
> > wrote:
> >
> >> Thanks Duncan and Frederick,
> >>
> >> I suspected as much - there doesn't appear to be any reason
why conflicts
> >> between by.x and names(y) shouldn't and cannot be checked, but
I can see
> >> how this might be more trouble than its worth given it potentially
may
> >> break downstream packages (i.e. any cases where this occurs but
they
> >> expect
> >> the name of the key column(s) to remain the same).
> >>
> >> Best,
> >>
> >> Scott
> >>
> >> On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at
gmail.com>
> >> wrote:
> >>
> >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
> >> >
> >> >> Hi Scott,
> >> >>
> >> >> Thanks for the patch. I'm not really involved in R
development; it
> >> >> will be up to someone in the R core team to apply it. I
would hazard
> >> >> to say that even if correct (I haven't checked), it
will not be
> >> >> applied because the change might break existing code. For
example it
> >> >> seems like reasonable code might easily assume that a
column with the
> >> >> same name as "by.x" exists in the output of
'merge'. That's just my
> >> >> best guess... I don't participate on here often.
> >> >>
> >> >
> >> >
> >> > I think you're right.  If I were still a member of R
Core, I would want
> >> to
> >> > test this against all packages on CRAN and Bioconductor, and
since that
> >> > test takes a couple of days to run on my laptop, I'd
probably never get
> >> > around to it.
> >> >
> >> > There are lots of cases where "I would have done that
differently", but
> >> > most of them are far too much trouble to change now that R is
more than
> >> 20
> >> > years old.  And in many cases it will turn out that the way R
does it
> >> > actually does make more sense than the way I would have done
it.
> >> >
> >> > Duncan Murdoch
> >> >
> >> >
> >> >
> >> >> Cheers,
> >> >>
> >> >> Frederick
> >> >>
> >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
wrote:
> >> >>
> >> >>> The attached patch.diff will make merge.data.frame()
append the
> >> suffixes
> >> >>> to
> >> >>> columns with common names between by.x and names(y).
> >> >>>
> >> >>> Best,
> >> >>>
> >> >>> Scott Ritchie
> >> >>>
> >> >>> On 17 February 2018 at 11:15, Scott Ritchie
<s.ritchie73 at gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>> Hi Frederick,
> >> >>>>
> >> >>>> I would expect that any duplicate names in the
resulting data.frame
> >> >>>> would
> >> >>>> have the suffixes appended to them, regardless of
whether or not they
> >> >>>> are
> >> >>>> used as the join key. So in my example I would
expect "names.x" and
> >> >>>> "names.y" to indicate their source
data.frame.
> >> >>>>
> >> >>>> While careful reading of the documentation
reveals this is not the
> >> >>>> case, I
> >> >>>> would argue the intent of the suffixes
functionality should equally
> >> be
> >> >>>> applied to this type of case.
> >> >>>>
> >> >>>> If you agree this would be useful, I'm happy
to write a patch for
> >> >>>> merge.data.frame that will add suffixes in this
case - I intend to do
> >> >>>> the
> >> >>>> same for merge.data.table in the data.table
package where I initially
> >> >>>> encountered the edge case.
> >> >>>>
> >> >>>> Best,
> >> >>>>
> >> >>>> Scott
> >> >>>>
> >> >>>> On 17 February 2018 at 03:53, <frederik at
ofb.net> wrote:
> >> >>>>
> >> >>>> Hi Scott,
> >> >>>>>
> >> >>>>> It seems like reasonable behavior to me. What
result would you
> >> expect?
> >> >>>>> That the second "name" should be
called "name.y"?
> >> >>>>>
> >> >>>>> The "merge" documentation says:
> >> >>>>>
> >> >>>>>      If the columns in the data frames not
used in merging have any
> >> >>>>>      common names, these have ?suffixes?
(?".x"? and ?".y"? by
> >> default)
> >> >>>>>      appended to try to make the names of the
result unique.
> >> >>>>>
> >> >>>>> Since the first "name" column was
used in merging, leaving both
> >> >>>>> without a suffix seems consistent with the
documentation...
> >> >>>>>
> >> >>>>> Frederick
> >> >>>>>
> >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100,
Scott Ritchie wrote:
> >> >>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I was unable to find a bug report for
this with a cursory search,
> >> but
> >> >>>>>>
> >> >>>>> would
> >> >>>>>
> >> >>>>>> like clarification if this is intended or
unavoidable behaviour:
> >> >>>>>>
> >> >>>>>> ```{r}
> >> >>>>>> # Create example data.frames
> >> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
"Lex"),
> >> >>>>>>                       
sex=c("F", "M", "F", "M"),
> >> >>>>>>                        age=c(41, 43, 36,
51))
> >> >>>>>> children <-
data.frame(parent=c("Sarah", "Max", "Qin"),
> >> >>>>>>                        
name=c("Oliver", "Sebastian", "Kai-lee"),
> >> >>>>>>                        
sex=c("M", "M", "F"),
> >> >>>>>>                         age=c(5,8,7))
> >> >>>>>>
> >> >>>>>> # Merge() creates a duplicated
"name" column:
> >> >>>>>> merge(parents, children, by.x =
"name", by.y = "parent")
> >> >>>>>> ```
> >> >>>>>>
> >> >>>>>> Output:
> >> >>>>>> ```
> >> >>>>>>     name sex.x age.x      name sex.y
age.y
> >> >>>>>> 1   Max     M    43 Sebastian     M     8
> >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> >> >>>>>> Warning message:
> >> >>>>>> In merge.data.frame(parents, children,
by.x = "name", by.y > >> >>>>>>
"parent") :
> >> >>>>>>    column name ?name? is duplicated in
the result
> >> >>>>>> ```
> >> >>>>>>
> >> >>>>>> Kind Regards,
> >> >>>>>>
> >> >>>>>> Scott Ritchie
> >> >>>>>>
> >> >>>>>>        [[alternative HTML version
deleted]]
> >> >>>>>>
> >> >>>>>>
______________________________________________
> >> >>>>>> R-devel at r-project.org mailing list
> >> >>>>>>
https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >> Index: src/library/base/R/merge.R
> >> >>>
==================================================================> >>
>>> --- src/library/base/R/merge.R  (revision 74264)
> >> >>> +++ src/library/base/R/merge.R  (working copy)
> >> >>> @@ -157,6 +157,15 @@
> >> >>>           }
> >> >>>             if(has.common.nms) names(y) <- nm.y
> >> >>> +        ## If by.x %in% names(y) then duplicate
column names still
> >> >>> arise,
> >> >>> +        ## apply suffixes to these
> >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> >> >>> +        if(length(dupe.keyx)) {
> >> >>> +          if(nzchar(suffixes[1L]))
> >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)]
<-
> >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> >> >>> +          if(nzchar(suffixes[2L]))
> >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)]
<-
> >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> >> >>> +        }
> >> >>>           nm <- c(names(x), names(y))
> >> >>>           if(any(d <- duplicated(nm)))
> >> >>>               if(sum(d) > 1L)
> >> >>>
> >> >>
> >> >> ______________________________________________
> >> >> R-devel at r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>
> >> >>
> >> >
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Scientist (Bioinformatics)
> > Genentech Research
> >
> Index: src/library/base/R/merge.R
> ==================================================================> ---
src/library/base/R/merge.R	(revision 74264)
> +++ src/library/base/R/merge.R	(working copy)
> @@ -157,6 +157,15 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to these
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[1L]))
> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Scott Ritchie

2018-Feb-20 23:06 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Hi Frederick,

It looks like I didn't overwrite the patch.diff file after the last edits.
Here's the correct patch (attached and copied below):

Index: src/library/base/R/merge.R
==================================================================---
src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
         }

         if(has.common.nms) names(y) <- nm.y
+        ## If by.x %in% names(y) then duplicate column names still arise,
+        ## apply suffixes to just y - this keeps backwards compatibility
+        ## when referring to by.x in the resulting data.frame
+        dupe.keyx <- intersect(nm.by, names(y))
+        if(length(dupe.keyx)) {
+          if(nzchar(suffixes[2L]))
+            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
+        }
         nm <- c(names(x), names(y))
         if(any(d <- duplicated(nm)))
             if(sum(d) > 1L)

Best,

Scott

On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
> Hi Scott,
>
> I think that's a good idea and I tried your patch on my copy of the
> repository. But it looks to me like the recent patch is identical to
> the previous one, can you confirm this?
>
> Frederick
>
> On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > Thanks Gabriel,
> >
> > I think your suggested approach is 100% backwards compatible
> >
> > Currently in the case of duplicate column names only the first can be
> > indexed by its name. This will always be the column appearing in by.x,
> > meaning the column in y with the same name cannot be accessed.
Appending
> > ".y" (suffixes[2L]) to this column means it can now be
accessed, while
> > keeping the current behaviour of making the key columns always
accessible
> > by using the names provided to by.x.
> >
> > I've attached a new patch that has this behaviour.
> >
> > Best,
> >
> > Scott
> >
> > On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at
ucdavis.edu>
> wrote:
> >
> > > It seems like there is a way that is backwards compatible-ish in
the
> sense
> > > mentioned and still has the (arguably, but a good argument I
think)
> better
> > > behavior:
> > >
> > > if by.x is 'name', (AND by.y is not also 'name'),
then x's 'name'
> column
> > > is called name and y's 'name' column (not used int he
merge) is
> changed to
> > > name.y.
> > >
> > > Now of course this would still change output, but it would change
it to
> > > something I think would be better, while retaining the 'merge
columns
> > > retain their exact names' mechanic as documented.
> > >
> > > ~G
> > >
> > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at
gmail.com>
> > > wrote:
> > >
> > >> Thanks Duncan and Frederick,
> > >>
> > >> I suspected as much - there doesn't appear to be any
reason why
> conflicts
> > >> between by.x and names(y) shouldn't and cannot be
checked, but I can
> see
> > >> how this might be more trouble than its worth given it
potentially may
> > >> break downstream packages (i.e. any cases where this occurs
but they
> > >> expect
> > >> the name of the key column(s) to remain the same).
> > >>
> > >> Best,
> > >>
> > >> Scott
> > >>
> > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> murdoch.duncan at gmail.com>
> > >> wrote:
> > >>
> > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
> > >> >
> > >> >> Hi Scott,
> > >> >>
> > >> >> Thanks for the patch. I'm not really involved in
R development; it
> > >> >> will be up to someone in the R core team to apply
it. I would
> hazard
> > >> >> to say that even if correct (I haven't checked),
it will not be
> > >> >> applied because the change might break existing
code. For example
> it
> > >> >> seems like reasonable code might easily assume that
a column with
> the
> > >> >> same name as "by.x" exists in the output
of 'merge'. That's just my
> > >> >> best guess... I don't participate on here often.
> > >> >>
> > >> >
> > >> >
> > >> > I think you're right.  If I were still a member of R
Core, I would
> want
> > >> to
> > >> > test this against all packages on CRAN and Bioconductor,
and since
> that
> > >> > test takes a couple of days to run on my laptop, I'd
probably never
> get
> > >> > around to it.
> > >> >
> > >> > There are lots of cases where "I would have done
that differently",
> but
> > >> > most of them are far too much trouble to change now that
R is more
> than
> > >> 20
> > >> > years old.  And in many cases it will turn out that the
way R does
> it
> > >> > actually does make more sense than the way I would have
done it.
> > >> >
> > >> > Duncan Murdoch
> > >> >
> > >> >
> > >> >
> > >> >> Cheers,
> > >> >>
> > >> >> Frederick
> > >> >>
> > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott
Ritchie wrote:
> > >> >>
> > >> >>> The attached patch.diff will make
merge.data.frame() append the
> > >> suffixes
> > >> >>> to
> > >> >>> columns with common names between by.x and
names(y).
> > >> >>>
> > >> >>> Best,
> > >> >>>
> > >> >>> Scott Ritchie
> > >> >>>
> > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> s.ritchie73 at gmail.com>
> > >> >>> wrote:
> > >> >>>
> > >> >>> Hi Frederick,
> > >> >>>>
> > >> >>>> I would expect that any duplicate names in
the resulting
> data.frame
> > >> >>>> would
> > >> >>>> have the suffixes appended to them,
regardless of whether or not
> they
> > >> >>>> are
> > >> >>>> used as the join key. So in my example I
would expect "names.x"
> and
> > >> >>>> "names.y" to indicate their source
data.frame.
> > >> >>>>
> > >> >>>> While careful reading of the documentation
reveals this is not
> the
> > >> >>>> case, I
> > >> >>>> would argue the intent of the suffixes
functionality should
> equally
> > >> be
> > >> >>>> applied to this type of case.
> > >> >>>>
> > >> >>>> If you agree this would be useful, I'm
happy to write a patch for
> > >> >>>> merge.data.frame that will add suffixes in
this case - I intend
> to do
> > >> >>>> the
> > >> >>>> same for merge.data.table in the data.table
package where I
> initially
> > >> >>>> encountered the edge case.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>>
> > >> >>>> Scott
> > >> >>>>
> > >> >>>> On 17 February 2018 at 03:53, <frederik
at ofb.net> wrote:
> > >> >>>>
> > >> >>>> Hi Scott,
> > >> >>>>>
> > >> >>>>> It seems like reasonable behavior to me.
What result would you
> > >> expect?
> > >> >>>>> That the second "name" should
be called "name.y"?
> > >> >>>>>
> > >> >>>>> The "merge" documentation
says:
> > >> >>>>>
> > >> >>>>>      If the columns in the data frames
not used in merging have
> any
> > >> >>>>>      common names, these have ?suffixes?
(?".x"? and ?".y"? by
> > >> default)
> > >> >>>>>      appended to try to make the names
of the result unique.
> > >> >>>>>
> > >> >>>>> Since the first "name" column
was used in merging, leaving both
> > >> >>>>> without a suffix seems consistent with
the documentation...
> > >> >>>>>
> > >> >>>>> Frederick
> > >> >>>>>
> > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM
+1100, Scott Ritchie wrote:
> > >> >>>>>
> > >> >>>>>> Hi,
> > >> >>>>>>
> > >> >>>>>> I was unable to find a bug report
for this with a cursory
> search,
> > >> but
> > >> >>>>>>
> > >> >>>>> would
> > >> >>>>>
> > >> >>>>>> like clarification if this is
intended or unavoidable
> behaviour:
> > >> >>>>>>
> > >> >>>>>> ```{r}
> > >> >>>>>> # Create example data.frames
> > >> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
"Lex"),
> > >> >>>>>>                       
sex=c("F", "M", "F", "M"),
> > >> >>>>>>                        age=c(41, 43,
36, 51))
> > >> >>>>>> children <-
data.frame(parent=c("Sarah", "Max", "Qin"),
> > >> >>>>>>                        
name=c("Oliver", "Sebastian",
> "Kai-lee"),
> > >> >>>>>>                        
sex=c("M", "M", "F"),
> > >> >>>>>>                        
age=c(5,8,7))
> > >> >>>>>>
> > >> >>>>>> # Merge() creates a duplicated
"name" column:
> > >> >>>>>> merge(parents, children, by.x =
"name", by.y = "parent")
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Output:
> > >> >>>>>> ```
> > >> >>>>>>     name sex.x age.x      name sex.y
age.y
> > >> >>>>>> 1   Max     M    43 Sebastian     M 
8
> > >> >>>>>> 2   Qin     F    36   Kai-lee     F 
7
> > >> >>>>>> 3 Sarah     F    41    Oliver     M 
5
> > >> >>>>>> Warning message:
> > >> >>>>>> In merge.data.frame(parents,
children, by.x = "name", by.y > > >>
>>>>>> "parent") :
> > >> >>>>>>    column name ?name? is duplicated
in the result
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Kind Regards,
> > >> >>>>>>
> > >> >>>>>> Scott Ritchie
> > >> >>>>>>
> > >> >>>>>>        [[alternative HTML version
deleted]]
> > >> >>>>>>
> > >> >>>>>>
______________________________________________
> > >> >>>>>> R-devel at r-project.org mailing
list
> > >> >>>>>>
https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >> Index: src/library/base/R/merge.R
> > >> >>>
===========================================================> ======> >
>> >>> --- src/library/base/R/merge.R  (revision 74264)
> > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > >> >>> @@ -157,6 +157,15 @@
> > >> >>>           }
> > >> >>>             if(has.common.nms) names(y) <-
nm.y
> > >> >>> +        ## If by.x %in% names(y) then duplicate
column names
> still
> > >> >>> arise,
> > >> >>> +        ## apply suffixes to these
> > >> >>> +        dupe.keyx <- intersect(nm.by,
names(y))
> > >> >>> +        if(length(dupe.keyx)) {
> > >> >>> +          if(nzchar(suffixes[1L]))
> > >> >>> +            names(x)[match(dupe.keyx, names(x),
0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > >> >>> +          if(nzchar(suffixes[2L]))
> > >> >>> +            names(y)[match(dupe.keyx, names(y),
0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > >> >>> +        }
> > >> >>>           nm <- c(names(x), names(y))
> > >> >>>           if(any(d <- duplicated(nm)))
> > >> >>>               if(sum(d) > 1L)
> > >> >>>
> > >> >>
> > >> >> ______________________________________________
> > >> >> R-devel at r-project.org mailing list
> > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>         [[alternative HTML version deleted]]
> > >>
> > >> ______________________________________________
> > >> R-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >
> > >
> > >
> > > --
> > > Gabriel Becker, PhD
> > > Scientist (Bioinformatics)
> > > Genentech Research
> > >
>
> > Index: src/library/base/R/merge.R
> > ==================================================================>
> --- src/library/base/R/merge.R        (revision 74264)
> > +++ src/library/base/R/merge.R        (working copy)
> > @@ -157,6 +157,15 @@
> >          }
> >
> >          if(has.common.nms) names(y) <- nm.y
> > +        ## If by.x %in% names(y) then duplicate column names still
> arise,
> > +        ## apply suffixes to these
> > +        dupe.keyx <- intersect(nm.by, names(y))
> > +        if(length(dupe.keyx)) {
> > +          if(nzchar(suffixes[1L]))
> > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> paste(dupe.keyx, suffixes[1L], sep="")
> > +          if(nzchar(suffixes[2L]))
> > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx, suffixes[2L], sep="")
> > +        }
> >          nm <- c(names(x), names(y))
> >          if(any(d <- duplicated(nm)))
> >              if(sum(d) > 1L)
>
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.diff
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20180221/7e77cbd6/attachment.ksh>

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Feb 2018 - Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Seemingly Similar Threads