thr3ads.net - R devel - [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2018-Feb-22 11:31 UTC

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

>>>>> Gabriel Becker <gmbecker at ucdavis.edu>
>>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
    > Hi all,
    > For the record this approach isnt 100% backwards compatible, because
    > names(mergeddf) will e incompatibly different. Thatx why i claimed
    > bakcwards compatable-ish

exactly.

    > That said its still worth considering imho because of the reasons
stated
    > (and honestly one particular simple reading of the docs might suggest
that
    > this was thr intended behavior all along). Im not a member of Rcore
through
    > so i cant do said considering myself.

I agree with Scott, Frederik and you that this changes seems
worth considering.
As Duncan Murdoch has mentioned, this alone may not be
sufficient.

In addition to your proposed patch (which I have simplified, not
using intersection() but working with underlying  match()
directly), it is little work to introduce an extra argument, I'm
calling  'no.dups = TRUE'  which when set to false would mirror
current R's behavior... and documenting it, then also documents the
new behavior (to some extent).

My plan is to commit it soonish ;-)
Martin

    > Best,
    > ~G

    > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:

    > Hi Scott,

    > I tried the new patch and can confirm that it has the advertised
    > behavior on a couple of test cases. I think it makes sense to apply
    > it, because any existing code which refers to a second duplicate
    > data.frame column by name is already broken, while if the reference is
    > by numerical index then changing the column name shouldn't break
it.

    > I don't know if you need to update the documentation as part of
your
    > patch, or if whoever applies it would be happy to do that. Somebody
    > from R core want to weigh in on this?

    > I attach a file with the test example from your original email as well
    > as a second test case I added with two "by" columns.

    > Thanks,

    > Frederick

    > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> Hi Frederick,
    >> 
    >> It looks like I didn't overwrite the patch.diff file after the
last edits.
    >> Here's the correct patch (attached and copied below):
    >> 
    >> Index: src/library/base/R/merge.R
    >> ================================================================== 
>> --- src/library/base/R/merge.R (revision 74280)
    >> +++ src/library/base/R/merge.R (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >> 
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still
arise,
    >> +        ## apply suffixes to just y - this keeps backwards
compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx,
    >> suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)
    >> 
    >> Best,
    >> 
    >> Scott
    >> 
    >> On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
    >> 
    >> > Hi Scott,
    >> >
    >> > I think that's a good idea and I tried your patch on my
copy of the
    >> > repository. But it looks to me like the recent patch is
identical to
    >> > the previous one, can you confirm this?
    >> >
    >> > Frederick
    >> >
    >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
    >> > > Thanks Gabriel,
    >> > >
    >> > > I think your suggested approach is 100% backwards
compatible
    >> > >
    >> > > Currently in the case of duplicate column names only the
first can be
    >> > > indexed by its name. This will always be the column
appearing in by.x,
    >> > > meaning the column in y with the same name cannot be
accessed.
    > Appending
    >> > > ".y" (suffixes[2L]) to this column means it can
now be accessed, while
    >> > > keeping the current behaviour of making the key columns
always
    > accessible
    >> > > by using the names provided to by.x.
    >> > >
    >> > > I've attached a new patch that has this behaviour.
    >> > >
    >> > > Best,
    >> > >
    >> > > Scott
    >> > >
    >> > > On 19 February 2018 at 05:08, Gabriel Becker <gmbecker
at ucdavis.edu>
    >> > wrote:
    >> > >
    >> > > > It seems like there is a way that is backwards
compatible-ish in the
    >> > sense
    >> > > > mentioned and still has the (arguably, but a good
argument I think)
    >> > better
    >> > > > behavior:
    >> > > >
    >> > > > if by.x is 'name', (AND by.y is not also
'name'), then x's 'name'
    >> > column
    >> > > > is called name and y's 'name' column
(not used int he merge) is
    >> > changed to
    >> > > > name.y.
    >> > > >
    >> > > > Now of course this would still change output, but it
would change
    > it to
    >> > > > something I think would be better, while retaining
the 'merge
    > columns
    >> > > > retain their exact names' mechanic as
documented.
    >> > > >
    >> > > > ~G
    >> > > >
    >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
    > s.ritchie73 at gmail.com>
    >> > > > wrote:
    >> > > >
    >> > > >> Thanks Duncan and Frederick,
    >> > > >>
    >> > > >> I suspected as much - there doesn't appear
to be any reason why
    >> > conflicts
    >> > > >> between by.x and names(y) shouldn't and
cannot be checked, but I
    > can
    >> > see
    >> > > >> how this might be more trouble than its worth
given it potentially
    > may
    >> > > >> break downstream packages (i.e. any cases where
this occurs but
    > they
    >> > > >> expect
    >> > > >> the name of the key column(s) to remain the
same).
    >> > > >>
    >> > > >> Best,
    >> > > >>
    >> > > >> Scott
    >> > > >>
    >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch
<
    >> > murdoch.duncan at gmail.com>
    >> > > >> wrote:
    >> > > >>
    >> > > >> > On 17/02/2018 6:36 PM, frederik at ofb.net
wrote:
    >> > > >> >
    >> > > >> >> Hi Scott,
    >> > > >> >>
    >> > > >> >> Thanks for the patch. I'm not
really involved in R development;
    > it
    >> > > >> >> will be up to someone in the R core
team to apply it. I would
    >> > hazard
    >> > > >> >> to say that even if correct (I
haven't checked), it will not be
    >> > > >> >> applied because the change might break
existing code. For
    > example
    >> > it
    >> > > >> >> seems like reasonable code might easily
assume that a column
    > with
    >> > the
    >> > > >> >> same name as "by.x" exists in
the output of 'merge'. That's
    > just my
    >> > > >> >> best guess... I don't participate
on here often.
    >> > > >> >>
    >> > > >> >
    >> > > >> >
    >> > > >> > I think you're right.  If I were still
a member of R Core, I
    > would
    >> > want
    >> > > >> to
    >> > > >> > test this against all packages on CRAN and
Bioconductor, and
    > since
    >> > that
    >> > > >> > test takes a couple of days to run on my
laptop, I'd probably
    > never
    >> > get
    >> > > >> > around to it.
    >> > > >> >
    >> > > >> > There are lots of cases where "I would
have done that
    > differently",
    >> > but
    >> > > >> > most of them are far too much trouble to
change now that R is
    > more
    >> > than
    >> > > >> 20
    >> > > >> > years old.  And in many cases it will turn
out that the way R
    > does
    >> > it
    >> > > >> > actually does make more sense than the way
I would have done it.
    >> > > >> >
    >> > > >> > Duncan Murdoch
    >> > > >> >
    >> > > >> >
    >> > > >> >
    >> > > >> >> Cheers,
    >> > > >> >>
    >> > > >> >> Frederick
    >> > > >> >>
    >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM
+1100, Scott Ritchie wrote:
    >> > > >> >>
    >> > > >> >>> The attached patch.diff will make
merge.data.frame() append the
    >> > > >> suffixes
    >> > > >> >>> to
    >> > > >> >>> columns with common names between
by.x and names(y).
    >> > > >> >>>
    >> > > >> >>> Best,
    >> > > >> >>>
    >> > > >> >>> Scott Ritchie
    >> > > >> >>>
    >> > > >> >>> On 17 February 2018 at 11:15, Scott
Ritchie <
    >> > s.ritchie73 at gmail.com>
    >> > > >> >>> wrote:
    >> > > >> >>>
    >> > > >> >>> Hi Frederick,
    >> > > >> >>>>
    >> > > >> >>>> I would expect that any
duplicate names in the resulting
    >> > data.frame
    >> > > >> >>>> would
    >> > > >> >>>> have the suffixes appended to
them, regardless of whether or
    > not
    >> > they
    >> > > >> >>>> are
    >> > > >> >>>> used as the join key. So in my
example I would expect
    > "names.x"
    >> > and
    >> > > >> >>>> "names.y" to indicate
their source data.frame.
    >> > > >> >>>>
    >> > > >> >>>> While careful reading of the
documentation reveals this is not
    >> > the
    >> > > >> >>>> case, I
    >> > > >> >>>> would argue the intent of the
suffixes functionality should
    >> > equally
    >> > > >> be
    >> > > >> >>>> applied to this type of case.
    >> > > >> >>>>
    >> > > >> >>>> If you agree this would be
useful, I'm happy to write a patch
    > for
    >> > > >> >>>> merge.data.frame that will add
suffixes in this case - I
    > intend
    >> > to do
    >> > > >> >>>> the
    >> > > >> >>>> same for merge.data.table in
the data.table package where I
    >> > initially
    >> > > >> >>>> encountered the edge case.
    >> > > >> >>>>
    >> > > >> >>>> Best,
    >> > > >> >>>>
    >> > > >> >>>> Scott
    >> > > >> >>>>
    >> > > >> >>>> On 17 February 2018 at 03:53,
<frederik at ofb.net> wrote:
    >> > > >> >>>>
    >> > > >> >>>> Hi Scott,
    >> > > >> >>>>>
    >> > > >> >>>>> It seems like reasonable
behavior to me. What result would
    > you
    >> > > >> expect?
    >> > > >> >>>>> That the second
"name" should be called "name.y"?
    >> > > >> >>>>>
    >> > > >> >>>>> The "merge"
documentation says:
    >> > > >> >>>>>
    >> > > >> >>>>>      If the columns in the
data frames not used in merging
    > have
    >> > any
    >> > > >> >>>>>      common names, these
have ?suffixes? (?".x"? and ?".y"?
    > by
    >> > > >> default)
    >> > > >> >>>>>      appended to try to
make the names of the result unique.
    >> > > >> >>>>>
    >> > > >> >>>>> Since the first
"name" column was used in merging, leaving
    > both
    >> > > >> >>>>> without a suffix seems
consistent with the documentation...
    >> > > >> >>>>>
    >> > > >> >>>>> Frederick
    >> > > >> >>>>>
    >> > > >> >>>>> On Fri, Feb 16, 2018 at
09:08:29AM +1100, Scott Ritchie
    > wrote:
    >> > > >> >>>>>
    >> > > >> >>>>>> Hi,
    >> > > >> >>>>>>
    >> > > >> >>>>>> I was unable to find a
bug report for this with a cursory
    >> > search,
    >> > > >> but
    >> > > >> >>>>>>
    >> > > >> >>>>> would
    >> > > >> >>>>>
    >> > > >> >>>>>> like clarification if
this is intended or unavoidable
    >> > behaviour:
    >> > > >> >>>>>>
    >> > > >> >>>>>> ```{r}
    >> > > >> >>>>>> # Create example
data.frames
    >> > > >> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
"Lex"),
    >> > > >> >>>>>>                       
sex=c("F", "M", "F", "M"),
    >> > > >> >>>>>>                       
age=c(41, 43, 36, 51))
    >> > > >> >>>>>> children <-
data.frame(parent=c("Sarah", "Max", "Qin"),
    >> > > >> >>>>>>                        
name=c("Oliver", "Sebastian",
    >> > "Kai-lee"),
    >> > > >> >>>>>>                        
sex=c("M", "M", "F"),
    >> > > >> >>>>>>                        
age=c(5,8,7))
    >> > > >> >>>>>>
    >> > > >> >>>>>> # Merge() creates a
duplicated "name" column:
    >> > > >> >>>>>> merge(parents,
children, by.x = "name", by.y = "parent")
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Output:
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>     name sex.x age.x   
name sex.y age.y
    >> > > >> >>>>>> 1   Max     M    43
Sebastian     M     8
    >> > > >> >>>>>> 2   Qin     F    36  
Kai-lee     F     7
    >> > > >> >>>>>> 3 Sarah     F    41   
Oliver     M     5
    >> > > >> >>>>>> Warning message:
    >> > > >> >>>>>> In
merge.data.frame(parents, children, by.x = "name", by.y     >>
> > >> >>>>>> "parent") :
    >> > > >> >>>>>>    column name ?name?
is duplicated in the result
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Kind Regards,
    >> > > >> >>>>>>
    >> > > >> >>>>>> Scott Ritchie
    >> > > >> >>>>>>
    >> > > >> >>>>>>        [[alternative
HTML version deleted]]
    >> > > >> >>>>>>
    >> > > >> >>>>>>
______________________________________________
    >> > > >> >>>>>> R-devel at
r-project.org mailing list
    >> > > >> >>>>>>
https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>>>>>
    >> > > >> >>>>>>
    >> > > >> >>>>>
    >> > > >> >>>>
    >> > > >> >>>>
    >> > > >> >> Index: src/library/base/R/merge.R
    >> > > >> >>>
===========================================================    >> >
======    >> > > >> >>> ---
src/library/base/R/merge.R  (revision 74264)
    >> > > >> >>> +++ src/library/base/R/merge.R 
(working copy)
    >> > > >> >>> @@ -157,6 +157,15 @@
    >> > > >> >>>           }
    >> > > >> >>>             if(has.common.nms)
names(y) <- nm.y
    >> > > >> >>> +        ## If by.x %in% names(y)
then duplicate column names
    >> > still
    >> > > >> >>> arise,
    >> > > >> >>> +        ## apply suffixes to these
    >> > > >> >>> +        dupe.keyx <-
intersect(nm.by, names(y))
    >> > > >> >>> +        if(length(dupe.keyx)) {
    >> > > >> >>> +          if(nzchar(suffixes[1L]))
    >> > > >> >>> +           
names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[1L],
sep="")
    >> > > >> >>> +          if(nzchar(suffixes[2L]))
    >> > > >> >>> +           
names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[2L],
sep="")
    >> > > >> >>> +        }
    >> > > >> >>>           nm <- c(names(x),
names(y))
    >> > > >> >>>           if(any(d <-
duplicated(nm)))
    >> > > >> >>>               if(sum(d) > 1L)
    >> > > >> >>>
    >> > > >> >>
    >> > > >> >>
______________________________________________
    >> > > >> >> R-devel at r-project.org mailing list
    >> > > >> >>
https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>
    >> > > >> >>
    >> > > >> >
    >> > > >>
    >> > > >>         [[alternative HTML version deleted]]
    >> > > >>
    >> > > >> ______________________________________________
    >> > > >> R-devel at r-project.org mailing list
    >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >>
    >> > > >
    >> > > >
    >> > > >
    >> > > > --
    >> > > > Gabriel Becker, PhD
    >> > > > Scientist (Bioinformatics)
    >> > > > Genentech Research
    >> > > >
    >> >
    >> > > Index: src/library/base/R/merge.R
    >> > >
==================================================================    >>
> > --- src/library/base/R/merge.R        (revision 74264)
    >> > > +++ src/library/base/R/merge.R        (working copy)
    >> > > @@ -157,6 +157,15 @@
    >> > >          }
    >> > >
    >> > >          if(has.common.nms) names(y) <- nm.y
    >> > > +        ## If by.x %in% names(y) then duplicate column
names still
    >> > arise,
    >> > > +        ## apply suffixes to these
    >> > > +        dupe.keyx <- intersect(nm.by, names(y))
    >> > > +        if(length(dupe.keyx)) {
    >> > > +          if(nzchar(suffixes[1L]))
    >> > > +            names(x)[match(dupe.keyx, names(x), 0L)]
<-
    >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> > > +          if(nzchar(suffixes[2L]))
    >> > > +            names(y)[match(dupe.keyx, names(y), 0L)]
<-
    >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> > > +        }
    >> > >          nm <- c(names(x), names(y))
    >> > >          if(any(d <- duplicated(nm)))
    >> > >              if(sum(d) > 1L)
    >> >
    >> > > ______________________________________________
    >> > > R-devel at r-project.org mailing list
    >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >
    >> >

    >> Index: src/library/base/R/merge.R
    >> ================================================================== 
>> --- src/library/base/R/merge.R        (revision 74280)
    >> +++ src/library/base/R/merge.R        (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >> 
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still
arise,
    >> +        ## apply suffixes to just y - this keeps backwards
compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx,
    > suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Scott Ritchie

2018-Feb-23 01:32 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Thanks Martin!

Can you clarify the functionality of the 'no.dups' argument so I can
change
my patch to `data.table:::merge.data.table` accordingly?

- When `no.dups=TRUE` will the suffix to the by.x column name? Or will it
take the functionality of the second functionality where only the column in
y has the suffix added?
- When `no.dups=FALSE` will the output be the same as it currently (no
suffix added to either column)? Or will add the suffix to the column in y?

Best,

Scott

On 22 February 2018 at 22:31, Martin Maechler <maechler at
stat.math.ethz.ch>
wrote:
> >>>>> Gabriel Becker <gmbecker at ucdavis.edu>
> >>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
>
>     > Hi all,
>     > For the record this approach isnt 100% backwards compatible,
because
>     > names(mergeddf) will e incompatibly different. Thatx why i claimed
>     > bakcwards compatable-ish
>
> exactly.
>
>     > That said its still worth considering imho because of the reasons
> stated
>     > (and honestly one particular simple reading of the docs might
> suggest that
>     > this was thr intended behavior all along). Im not a member of
Rcore
> through
>     > so i cant do said considering myself.
>
> I agree with Scott, Frederik and you that this changes seems
> worth considering.
> As Duncan Murdoch has mentioned, this alone may not be
> sufficient.
>
> In addition to your proposed patch (which I have simplified, not
> using intersection() but working with underlying  match()
> directly), it is little work to introduce an extra argument, I'm
> calling  'no.dups = TRUE'  which when set to false would mirror
> current R's behavior... and documenting it, then also documents the
> new behavior (to some extent).
>
> My plan is to commit it soonish ;-)
> Martin
>
>     > Best,
>     > ~G
>
>     > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:
>
>     > Hi Scott,
>
>     > I tried the new patch and can confirm that it has the advertised
>     > behavior on a couple of test cases. I think it makes sense to
apply
>     > it, because any existing code which refers to a second duplicate
>     > data.frame column by name is already broken, while if the
reference
> is
>     > by numerical index then changing the column name shouldn't
break it.
>
>     > I don't know if you need to update the documentation as part
of your
>     > patch, or if whoever applies it would be happy to do that.
Somebody
>     > from R core want to weigh in on this?
>
>     > I attach a file with the test example from your original email as
> well
>     > as a second test case I added with two "by" columns.
>
>     > Thanks,
>
>     > Frederick
>
>     > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
>     >> Hi Frederick,
>     >>
>     >> It looks like I didn't overwrite the patch.diff file after
the last
> edits.
>     >> Here's the correct patch (attached and copied below):
>     >>
>     >> Index: src/library/base/R/merge.R
>     >>
==================================================================>    
>> --- src/library/base/R/merge.R (revision 74280)
>     >> +++ src/library/base/R/merge.R (working copy)
>     >> @@ -157,6 +157,14 @@
>     >> }
>     >>
>     >> if(has.common.nms) names(y) <- nm.y
>     >> +        ## If by.x %in% names(y) then duplicate column names
still
> arise,
>     >> +        ## apply suffixes to just y - this keeps backwards
> compatibility
>     >> +        ## when referring to by.x in the resulting data.frame
>     >> +        dupe.keyx <- intersect(nm.by, names(y))
>     >> +        if(length(dupe.keyx)) {
>     >> +          if(nzchar(suffixes[2L]))
>     >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx,
>     >> suffixes[2L], sep="")
>     >> +        }
>     >> nm <- c(names(x), names(y))
>     >> if(any(d <- duplicated(nm)))
>     >> if(sum(d) > 1L)
>     >>
>     >> Best,
>     >>
>     >> Scott
>     >>
>     >> On 21 February 2018 at 08:23, <frederik at ofb.net>
wrote:
>     >>
>     >> > Hi Scott,
>     >> >
>     >> > I think that's a good idea and I tried your patch on
my copy of
> the
>     >> > repository. But it looks to me like the recent patch is
identical
> to
>     >> > the previous one, can you confirm this?
>     >> >
>     >> > Frederick
>     >> >
>     >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie
wrote:
>     >> > > Thanks Gabriel,
>     >> > >
>     >> > > I think your suggested approach is 100% backwards
compatible
>     >> > >
>     >> > > Currently in the case of duplicate column names only
the first
> can be
>     >> > > indexed by its name. This will always be the column
appearing
> in by.x,
>     >> > > meaning the column in y with the same name cannot be
accessed.
>     > Appending
>     >> > > ".y" (suffixes[2L]) to this column means
it can now be
> accessed, while
>     >> > > keeping the current behaviour of making the key
columns always
>     > accessible
>     >> > > by using the names provided to by.x.
>     >> > >
>     >> > > I've attached a new patch that has this
behaviour.
>     >> > >
>     >> > > Best,
>     >> > >
>     >> > > Scott
>     >> > >
>     >> > > On 19 February 2018 at 05:08, Gabriel Becker <
> gmbecker at ucdavis.edu>
>     >> > wrote:
>     >> > >
>     >> > > > It seems like there is a way that is backwards
compatible-ish
> in the
>     >> > sense
>     >> > > > mentioned and still has the (arguably, but a
good argument I
> think)
>     >> > better
>     >> > > > behavior:
>     >> > > >
>     >> > > > if by.x is 'name', (AND by.y is not
also 'name'), then x's
> 'name'
>     >> > column
>     >> > > > is called name and y's 'name'
column (not used int he merge)
> is
>     >> > changed to
>     >> > > > name.y.
>     >> > > >
>     >> > > > Now of course this would still change output,
but it would
> change
>     > it to
>     >> > > > something I think would be better, while
retaining the 'merge
>     > columns
>     >> > > > retain their exact names' mechanic as
documented.
>     >> > > >
>     >> > > > ~G
>     >> > > >
>     >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie
<
>     > s.ritchie73 at gmail.com>
>     >> > > > wrote:
>     >> > > >
>     >> > > >> Thanks Duncan and Frederick,
>     >> > > >>
>     >> > > >> I suspected as much - there doesn't
appear to be any reason
> why
>     >> > conflicts
>     >> > > >> between by.x and names(y) shouldn't and
cannot be checked,
> but I
>     > can
>     >> > see
>     >> > > >> how this might be more trouble than its
worth given it
> potentially
>     > may
>     >> > > >> break downstream packages (i.e. any cases
where this occurs
> but
>     > they
>     >> > > >> expect
>     >> > > >> the name of the key column(s) to remain the
same).
>     >> > > >>
>     >> > > >> Best,
>     >> > > >>
>     >> > > >> Scott
>     >> > > >>
>     >> > > >> On 18 February 2018 at 11:48, Duncan
Murdoch <
>     >> > murdoch.duncan at gmail.com>
>     >> > > >> wrote:
>     >> > > >>
>     >> > > >> > On 17/02/2018 6:36 PM, frederik at
ofb.net wrote:
>     >> > > >> >
>     >> > > >> >> Hi Scott,
>     >> > > >> >>
>     >> > > >> >> Thanks for the patch. I'm not
really involved in R
> development;
>     > it
>     >> > > >> >> will be up to someone in the R
core team to apply it. I
> would
>     >> > hazard
>     >> > > >> >> to say that even if correct (I
haven't checked), it will
> not be
>     >> > > >> >> applied because the change might
break existing code. For
>     > example
>     >> > it
>     >> > > >> >> seems like reasonable code might
easily assume that a
> column
>     > with
>     >> > the
>     >> > > >> >> same name as "by.x"
exists in the output of 'merge'.
> That's
>     > just my
>     >> > > >> >> best guess... I don't
participate on here often.
>     >> > > >> >>
>     >> > > >> >
>     >> > > >> >
>     >> > > >> > I think you're right.  If I were
still a member of R Core,
> I
>     > would
>     >> > want
>     >> > > >> to
>     >> > > >> > test this against all packages on CRAN
and Bioconductor,
> and
>     > since
>     >> > that
>     >> > > >> > test takes a couple of days to run on
my laptop, I'd
> probably
>     > never
>     >> > get
>     >> > > >> > around to it.
>     >> > > >> >
>     >> > > >> > There are lots of cases where "I
would have done that
>     > differently",
>     >> > but
>     >> > > >> > most of them are far too much trouble
to change now that R
> is
>     > more
>     >> > than
>     >> > > >> 20
>     >> > > >> > years old.  And in many cases it will
turn out that the
> way R
>     > does
>     >> > it
>     >> > > >> > actually does make more sense than the
way I would have
> done it.
>     >> > > >> >
>     >> > > >> > Duncan Murdoch
>     >> > > >> >
>     >> > > >> >
>     >> > > >> >
>     >> > > >> >> Cheers,
>     >> > > >> >>
>     >> > > >> >> Frederick
>     >> > > >> >>
>     >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM
+1100, Scott Ritchie
> wrote:
>     >> > > >> >>
>     >> > > >> >>> The attached patch.diff will
make merge.data.frame()
> append the
>     >> > > >> suffixes
>     >> > > >> >>> to
>     >> > > >> >>> columns with common names
between by.x and names(y).
>     >> > > >> >>>
>     >> > > >> >>> Best,
>     >> > > >> >>>
>     >> > > >> >>> Scott Ritchie
>     >> > > >> >>>
>     >> > > >> >>> On 17 February 2018 at 11:15,
Scott Ritchie <
>     >> > s.ritchie73 at gmail.com>
>     >> > > >> >>> wrote:
>     >> > > >> >>>
>     >> > > >> >>> Hi Frederick,
>     >> > > >> >>>>
>     >> > > >> >>>> I would expect that any
duplicate names in the resulting
>     >> > data.frame
>     >> > > >> >>>> would
>     >> > > >> >>>> have the suffixes appended
to them, regardless of
> whether or
>     > not
>     >> > they
>     >> > > >> >>>> are
>     >> > > >> >>>> used as the join key. So
in my example I would expect
>     > "names.x"
>     >> > and
>     >> > > >> >>>> "names.y" to
indicate their source data.frame.
>     >> > > >> >>>>
>     >> > > >> >>>> While careful reading of
the documentation reveals this
> is not
>     >> > the
>     >> > > >> >>>> case, I
>     >> > > >> >>>> would argue the intent of
the suffixes functionality
> should
>     >> > equally
>     >> > > >> be
>     >> > > >> >>>> applied to this type of
case.
>     >> > > >> >>>>
>     >> > > >> >>>> If you agree this would be
useful, I'm happy to write a
> patch
>     > for
>     >> > > >> >>>> merge.data.frame that will
add suffixes in this case - I
>     > intend
>     >> > to do
>     >> > > >> >>>> the
>     >> > > >> >>>> same for merge.data.table
in the data.table package
> where I
>     >> > initially
>     >> > > >> >>>> encountered the edge case.
>     >> > > >> >>>>
>     >> > > >> >>>> Best,
>     >> > > >> >>>>
>     >> > > >> >>>> Scott
>     >> > > >> >>>>
>     >> > > >> >>>> On 17 February 2018 at
03:53, <frederik at ofb.net> wrote:
>     >> > > >> >>>>
>     >> > > >> >>>> Hi Scott,
>     >> > > >> >>>>>
>     >> > > >> >>>>> It seems like
reasonable behavior to me. What result
> would
>     > you
>     >> > > >> expect?
>     >> > > >> >>>>> That the second
"name" should be called "name.y"?
>     >> > > >> >>>>>
>     >> > > >> >>>>> The "merge"
documentation says:
>     >> > > >> >>>>>
>     >> > > >> >>>>>      If the columns in
the data frames not used in
> merging
>     > have
>     >> > any
>     >> > > >> >>>>>      common names,
these have ?suffixes? (?".x"? and
> ?".y"?
>     > by
>     >> > > >> default)
>     >> > > >> >>>>>      appended to try
to make the names of the result
> unique.
>     >> > > >> >>>>>
>     >> > > >> >>>>> Since the first
"name" column was used in merging,
> leaving
>     > both
>     >> > > >> >>>>> without a suffix seems
consistent with the
> documentation...
>     >> > > >> >>>>>
>     >> > > >> >>>>> Frederick
>     >> > > >> >>>>>
>     >> > > >> >>>>> On Fri, Feb 16, 2018
at 09:08:29AM +1100, Scott Ritchie
>     > wrote:
>     >> > > >> >>>>>
>     >> > > >> >>>>>> Hi,
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> I was unable to
find a bug report for this with a
> cursory
>     >> > search,
>     >> > > >> but
>     >> > > >> >>>>>>
>     >> > > >> >>>>> would
>     >> > > >> >>>>>
>     >> > > >> >>>>>> like clarification
if this is intended or unavoidable
>     >> > behaviour:
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> ```{r}
>     >> > > >> >>>>>> # Create example
data.frames
>     >> > > >> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
> "Lex"),
>     >> > > >> >>>>>>                   
sex=c("F", "M", "F", "M"),
>     >> > > >> >>>>>>                   
age=c(41, 43, 36, 51))
>     >> > > >> >>>>>> children <-
data.frame(parent=c("Sarah", "Max",
> "Qin"),
>     >> > > >> >>>>>>                   
name=c("Oliver", "Sebastian",
>     >> > "Kai-lee"),
>     >> > > >> >>>>>>                   
sex=c("M", "M", "F"),
>     >> > > >> >>>>>>                   
age=c(5,8,7))
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> # Merge() creates
a duplicated "name" column:
>     >> > > >> >>>>>> merge(parents,
children, by.x = "name", by.y > "parent")
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Output:
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>     name sex.x
age.x      name sex.y age.y
>     >> > > >> >>>>>> 1   Max     M   
43 Sebastian     M     8
>     >> > > >> >>>>>> 2   Qin     F   
36   Kai-lee     F     7
>     >> > > >> >>>>>> 3 Sarah     F   
41    Oliver     M     5
>     >> > > >> >>>>>> Warning message:
>     >> > > >> >>>>>> In
merge.data.frame(parents, children, by.x = "name",
> by.y >     >> > > >> >>>>>>
"parent") :
>     >> > > >> >>>>>>    column name
?name? is duplicated in the result
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Kind Regards,
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Scott Ritchie
>     >> > > >> >>>>>>
>     >> > > >> >>>>>>       
[[alternative HTML version deleted]]
>     >> > > >> >>>>>>
>     >> > > >> >>>>>>
______________________________________________
>     >> > > >> >>>>>> R-devel at
r-project.org mailing list
>     >> > > >> >>>>>>
https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >> >>>>>>
>     >> > > >> >>>>>>
>     >> > > >> >>>>>
>     >> > > >> >>>>
>     >> > > >> >>>>
>     >> > > >> >> Index: src/library/base/R/merge.R
>     >> > > >> >>>
=============================> =============================>     >>
> ======>     >> > > >> >>> ---
src/library/base/R/merge.R  (revision 74264)
>     >> > > >> >>> +++ src/library/base/R/merge.R
(working copy)
>     >> > > >> >>> @@ -157,6 +157,15 @@
>     >> > > >> >>>           }
>     >> > > >> >>>             if(has.common.nms)
names(y) <- nm.y
>     >> > > >> >>> +        ## If by.x %in%
names(y) then duplicate column
> names
>     >> > still
>     >> > > >> >>> arise,
>     >> > > >> >>> +        ## apply suffixes to
these
>     >> > > >> >>> +        dupe.keyx <-
intersect(nm.by, names(y))
>     >> > > >> >>> +        if(length(dupe.keyx))
{
>     >> > > >> >>> +         
if(nzchar(suffixes[1L]))
>     >> > > >> >>> +           
names(x)[match(dupe.keyx, names(x), 0L)] <-
>     >> > > >> >>> paste(dupe.keyx, suffixes[1L],
sep="")
>     >> > > >> >>> +         
if(nzchar(suffixes[2L]))
>     >> > > >> >>> +           
names(y)[match(dupe.keyx, names(y), 0L)] <-
>     >> > > >> >>> paste(dupe.keyx, suffixes[2L],
sep="")
>     >> > > >> >>> +        }
>     >> > > >> >>>           nm <- c(names(x),
names(y))
>     >> > > >> >>>           if(any(d <-
duplicated(nm)))
>     >> > > >> >>>               if(sum(d) >
1L)
>     >> > > >> >>>
>     >> > > >> >>
>     >> > > >> >>
______________________________________________
>     >> > > >> >> R-devel at r-project.org mailing
list
>     >> > > >> >>
https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >> >>
>     >> > > >> >>
>     >> > > >> >
>     >> > > >>
>     >> > > >>         [[alternative HTML version
deleted]]
>     >> > > >>
>     >> > > >>
______________________________________________
>     >> > > >> R-devel at r-project.org mailing list
>     >> > > >>
https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >>
>     >> > > >
>     >> > > >
>     >> > > >
>     >> > > > --
>     >> > > > Gabriel Becker, PhD
>     >> > > > Scientist (Bioinformatics)
>     >> > > > Genentech Research
>     >> > > >
>     >> >
>     >> > > Index: src/library/base/R/merge.R
>     >> > >
===========================================================> ======>    
>> > > --- src/library/base/R/merge.R        (revision 74264)
>     >> > > +++ src/library/base/R/merge.R        (working copy)
>     >> > > @@ -157,6 +157,15 @@
>     >> > >          }
>     >> > >
>     >> > >          if(has.common.nms) names(y) <- nm.y
>     >> > > +        ## If by.x %in% names(y) then duplicate
column names
> still
>     >> > arise,
>     >> > > +        ## apply suffixes to these
>     >> > > +        dupe.keyx <- intersect(nm.by, names(y))
>     >> > > +        if(length(dupe.keyx)) {
>     >> > > +          if(nzchar(suffixes[1L]))
>     >> > > +            names(x)[match(dupe.keyx, names(x),
0L)] <-
>     >> > paste(dupe.keyx, suffixes[1L], sep="")
>     >> > > +          if(nzchar(suffixes[2L]))
>     >> > > +            names(y)[match(dupe.keyx, names(y),
0L)] <-
>     >> > paste(dupe.keyx, suffixes[2L], sep="")
>     >> > > +        }
>     >> > >          nm <- c(names(x), names(y))
>     >> > >          if(any(d <- duplicated(nm)))
>     >> > >              if(sum(d) > 1L)
>     >> >
>     >> > > ______________________________________________
>     >> > > R-devel at r-project.org mailing list
>     >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> >
>     >> >
>
>     >> Index: src/library/base/R/merge.R
>     >>
==================================================================>    
>> --- src/library/base/R/merge.R        (revision 74280)
>     >> +++ src/library/base/R/merge.R        (working copy)
>     >> @@ -157,6 +157,14 @@
>     >> }
>     >>
>     >> if(has.common.nms) names(y) <- nm.y
>     >> +        ## If by.x %in% names(y) then duplicate column names
still
> arise,
>     >> +        ## apply suffixes to just y - this keeps backwards
> compatibility
>     >> +        ## when referring to by.x in the resulting data.frame
>     >> +        dupe.keyx <- intersect(nm.by, names(y))
>     >> +        if(length(dupe.keyx)) {
>     >> +          if(nzchar(suffixes[2L]))
>     >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx,
>     > suffixes[2L], sep="")
>     >> +        }
>     >> nm <- c(names(x), names(y))
>     >> if(any(d <- duplicated(nm)))
>     >> if(sum(d) > 1L)
>
>     > [[alternative HTML version deleted]]
>
>     > ______________________________________________
>     > R-devel at r-project.org mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Martin Maechler

2018-Feb-23 14:03 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

>>>>> Scott Ritchie <s.ritchie73 at gmail.com>
>>>>>     on Fri, 23 Feb 2018 12:32:41 +1100 writes:
    > Thanks Martin!
    > Can you clarify the functionality of the 'no.dups' argument so
I can change
    > my patch to `data.table:::merge.data.table` accordingly?

    > - When `no.dups=TRUE` will the suffix to the by.x column name? Or will
it
    > take the functionality of the second functionality where only the
column in
    > y has the suffix added?
    > - When `no.dups=FALSE` will the output be the same as it currently (no
    > suffix added to either column)? Or will add the suffix to the column in
y?

I had started from your patch... and worked from there.
So, there's no need (and use) to provide another one.

I also needed to update the man page, add a regression test, add
an entry to NEWS.Rd ...

Just wait until I commit..
Martin




    > Best,

    > Scott

    > On 22 February 2018 at 22:31, Martin Maechler <maechler at
stat.math.ethz.ch>
    > wrote:

    >> >>>>> Gabriel Becker <gmbecker at ucdavis.edu>
    >> >>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
    >> 
    >> > Hi all,
    >> > For the record this approach isnt 100% backwards compatible,
because
    >> > names(mergeddf) will e incompatibly different. Thatx why i
claimed
    >> > bakcwards compatable-ish
    >> 
    >> exactly.
    >> 
    >> > That said its still worth considering imho because of the
reasons
    >> stated
    >> > (and honestly one particular simple reading of the docs might
    >> suggest that
    >> > this was thr intended behavior all along). Im not a member of
Rcore
    >> through
    >> > so i cant do said considering myself.
    >> 
    >> I agree with Scott, Frederik and you that this changes seems
    >> worth considering.
    >> As Duncan Murdoch has mentioned, this alone may not be
    >> sufficient.
    >> 
    >> In addition to your proposed patch (which I have simplified, not
    >> using intersection() but working with underlying  match()
    >> directly), it is little work to introduce an extra argument,
I'm
    >> calling  'no.dups = TRUE'  which when set to false would
mirror
    >> current R's behavior... and documenting it, then also documents
the
    >> new behavior (to some extent).
    >> 
    >> My plan is to commit it soonish ;-)
    >> Martin
    >> 
    >> > Best,
    >> > ~G
    >> 
    >> > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:
    >> 
    >> > Hi Scott,
    >> 
    >> > I tried the new patch and can confirm that it has the
advertised
    >> > behavior on a couple of test cases. I think it makes sense to
apply
    >> > it, because any existing code which refers to a second
duplicate
    >> > data.frame column by name is already broken, while if the
reference
    >> is
    >> > by numerical index then changing the column name shouldn't
break it.
    >> 
    >> > I don't know if you need to update the documentation as
part of your
    >> > patch, or if whoever applies it would be happy to do that.
Somebody
    >> > from R core want to weigh in on this?
    >> 
    >> > I attach a file with the test example from your original email
as
    >> well
    >> > as a second test case I added with two "by" columns.
    >> 
    >> > Thanks,
    >> 
    >> > Frederick
    >> 
    >> > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> >> Hi Frederick,
    >> >>
    >> >> It looks like I didn't overwrite the patch.diff file
after the last
    >> edits.
    >> >> Here's the correct patch (attached and copied below):
    >> >>
    >> >> Index: src/library/base/R/merge.R
    >> >>
==================================================================    >>
>> --- src/library/base/R/merge.R (revision 74280)
    >> >> +++ src/library/base/R/merge.R (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column
names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps
backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting
data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)]
<-
    >> paste(dupe.keyx,
    >> >> suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >> >>
    >> >> Best,
    >> >>
    >> >> Scott
    >> >>
    >> >> On 21 February 2018 at 08:23, <frederik at ofb.net>
wrote:
    >> >>
    >> >> > Hi Scott,
    >> >> >
    >> >> > I think that's a good idea and I tried your patch
on my copy of
    >> the
    >> >> > repository. But it looks to me like the recent patch
is identical
    >> to
    >> >> > the previous one, can you confirm this?
    >> >> >
    >> >> > Frederick
    >> >> >
    >> >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott
Ritchie wrote:
    >> >> > > Thanks Gabriel,
    >> >> > >
    >> >> > > I think your suggested approach is 100%
backwards compatible
    >> >> > >
    >> >> > > Currently in the case of duplicate column names
only the first
    >> can be
    >> >> > > indexed by its name. This will always be the
column appearing
    >> in by.x,
    >> >> > > meaning the column in y with the same name
cannot be accessed.
    >> > Appending
    >> >> > > ".y" (suffixes[2L]) to this column
means it can now be
    >> accessed, while
    >> >> > > keeping the current behaviour of making the key
columns always
    >> > accessible
    >> >> > > by using the names provided to by.x.
    >> >> > >
    >> >> > > I've attached a new patch that has this
behaviour.
    >> >> > >
    >> >> > > Best,
    >> >> > >
    >> >> > > Scott
    >> >> > >
    >> >> > > On 19 February 2018 at 05:08, Gabriel Becker
<
    >> gmbecker at ucdavis.edu>
    >> >> > wrote:
    >> >> > >
    >> >> > > > It seems like there is a way that is
backwards compatible-ish
    >> in the
    >> >> > sense
    >> >> > > > mentioned and still has the (arguably, but
a good argument I
    >> think)
    >> >> > better
    >> >> > > > behavior:
    >> >> > > >
    >> >> > > > if by.x is 'name', (AND by.y is not
also 'name'), then x's
    >> 'name'
    >> >> > column
    >> >> > > > is called name and y's 'name'
column (not used int he merge)
    >> is
    >> >> > changed to
    >> >> > > > name.y.
    >> >> > > >
    >> >> > > > Now of course this would still change
output, but it would
    >> change
    >> > it to
    >> >> > > > something I think would be better, while
retaining the 'merge
    >> > columns
    >> >> > > > retain their exact names' mechanic as
documented.
    >> >> > > >
    >> >> > > > ~G
    >> >> > > >
    >> >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott
Ritchie <
    >> > s.ritchie73 at gmail.com>
    >> >> > > > wrote:
    >> >> > > >
    >> >> > > >> Thanks Duncan and Frederick,
    >> >> > > >>
    >> >> > > >> I suspected as much - there doesn't
appear to be any reason
    >> why
    >> >> > conflicts
    >> >> > > >> between by.x and names(y) shouldn't
and cannot be checked,
    >> but I
    >> > can
    >> >> > see
    >> >> > > >> how this might be more trouble than its
worth given it
    >> potentially
    >> > may
    >> >> > > >> break downstream packages (i.e. any
cases where this occurs
    >> but
    >> > they
    >> >> > > >> expect
    >> >> > > >> the name of the key column(s) to remain
the same).
    >> >> > > >>
    >> >> > > >> Best,
    >> >> > > >>
    >> >> > > >> Scott
    >> >> > > >>
    >> >> > > >> On 18 February 2018 at 11:48, Duncan
Murdoch <
    >> >> > murdoch.duncan at gmail.com>
    >> >> > > >> wrote:
    >> >> > > >>
    >> >> > > >> > On 17/02/2018 6:36 PM, frederik at
ofb.net wrote:
    >> >> > > >> >
    >> >> > > >> >> Hi Scott,
    >> >> > > >> >>
    >> >> > > >> >> Thanks for the patch. I'm
not really involved in R
    >> development;
    >> > it
    >> >> > > >> >> will be up to someone in the R
core team to apply it. I
    >> would
    >> >> > hazard
    >> >> > > >> >> to say that even if correct (I
haven't checked), it will
    >> not be
    >> >> > > >> >> applied because the change
might break existing code. For
    >> > example
    >> >> > it
    >> >> > > >> >> seems like reasonable code
might easily assume that a
    >> column
    >> > with
    >> >> > the
    >> >> > > >> >> same name as "by.x"
exists in the output of 'merge'.
    >> That's
    >> > just my
    >> >> > > >> >> best guess... I don't
participate on here often.
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> > I think you're right.  If I
were still a member of R Core,
    >> I
    >> > would
    >> >> > want
    >> >> > > >> to
    >> >> > > >> > test this against all packages on
CRAN and Bioconductor,
    >> and
    >> > since
    >> >> > that
    >> >> > > >> > test takes a couple of days to run
on my laptop, I'd
    >> probably
    >> > never
    >> >> > get
    >> >> > > >> > around to it.
    >> >> > > >> >
    >> >> > > >> > There are lots of cases where
"I would have done that
    >> > differently",
    >> >> > but
    >> >> > > >> > most of them are far too much
trouble to change now that R
    >> is
    >> > more
    >> >> > than
    >> >> > > >> 20
    >> >> > > >> > years old.  And in many cases it
will turn out that the
    >> way R
    >> > does
    >> >> > it
    >> >> > > >> > actually does make more sense than
the way I would have
    >> done it.
    >> >> > > >> >
    >> >> > > >> > Duncan Murdoch
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >> Cheers,
    >> >> > > >> >>
    >> >> > > >> >> Frederick
    >> >> > > >> >>
    >> >> > > >> >> On Sat, Feb 17, 2018 at
04:42:21PM +1100, Scott Ritchie
    >> wrote:
    >> >> > > >> >>
    >> >> > > >> >>> The attached patch.diff
will make merge.data.frame()
    >> append the
    >> >> > > >> suffixes
    >> >> > > >> >>> to
    >> >> > > >> >>> columns with common names
between by.x and names(y).
    >> >> > > >> >>>
    >> >> > > >> >>> Best,
    >> >> > > >> >>>
    >> >> > > >> >>> Scott Ritchie
    >> >> > > >> >>>
    >> >> > > >> >>> On 17 February 2018 at
11:15, Scott Ritchie <
    >> >> > s.ritchie73 at gmail.com>
    >> >> > > >> >>> wrote:
    >> >> > > >> >>>
    >> >> > > >> >>> Hi Frederick,
    >> >> > > >> >>>>
    >> >> > > >> >>>> I would expect that
any duplicate names in the resulting
    >> >> > data.frame
    >> >> > > >> >>>> would
    >> >> > > >> >>>> have the suffixes
appended to them, regardless of
    >> whether or
    >> > not
    >> >> > they
    >> >> > > >> >>>> are
    >> >> > > >> >>>> used as the join key.
So in my example I would expect
    >> > "names.x"
    >> >> > and
    >> >> > > >> >>>> "names.y" to
indicate their source data.frame.
    >> >> > > >> >>>>
    >> >> > > >> >>>> While careful reading
of the documentation reveals this
    >> is not
    >> >> > the
    >> >> > > >> >>>> case, I
    >> >> > > >> >>>> would argue the intent
of the suffixes functionality
    >> should
    >> >> > equally
    >> >> > > >> be
    >> >> > > >> >>>> applied to this type
of case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> If you agree this
would be useful, I'm happy to write a
    >> patch
    >> > for
    >> >> > > >> >>>> merge.data.frame that
will add suffixes in this case - I
    >> > intend
    >> >> > to do
    >> >> > > >> >>>> the
    >> >> > > >> >>>> same for
merge.data.table in the data.table package
    >> where I
    >> >> > initially
    >> >> > > >> >>>> encountered the edge
case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> Best,
    >> >> > > >> >>>>
    >> >> > > >> >>>> Scott
    >> >> > > >> >>>>
    >> >> > > >> >>>> On 17 February 2018 at
03:53, <frederik at ofb.net> wrote:
    >> >> > > >> >>>>
    >> >> > > >> >>>> Hi Scott,
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> It seems like
reasonable behavior to me. What result
    >> would
    >> > you
    >> >> > > >> expect?
    >> >> > > >> >>>>> That the second
"name" should be called "name.y"?
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> The
"merge" documentation says:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>      If the
columns in the data frames not used in
    >> merging
    >> > have
    >> >> > any
    >> >> > > >> >>>>>      common names,
these have ?suffixes? (?".x"? and
    >> ?".y"?
    >> > by
    >> >> > > >> default)
    >> >> > > >> >>>>>      appended to
try to make the names of the result
    >> unique.
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Since the first
"name" column was used in merging,
    >> leaving
    >> > both
    >> >> > > >> >>>>> without a suffix
seems consistent with the
    >> documentation...
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Frederick
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> On Fri, Feb 16,
2018 at 09:08:29AM +1100, Scott Ritchie
    >> > wrote:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> Hi,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> I was unable
to find a bug report for this with a
    >> cursory
    >> >> > search,
    >> >> > > >> but
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>> would
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> like
clarification if this is intended or unavoidable
    >> >> > behaviour:
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> ```{r}
    >> >> > > >> >>>>>> # Create
example data.frames
    >> >> > > >> >>>>>> parents <-
data.frame(name=c("Sarah", "Max", "Qin",
    >> "Lex"),
    >> >> > > >> >>>>>>               
sex=c("F", "M", "F", "M"),
    >> >> > > >> >>>>>>               
age=c(41, 43, 36, 51))
    >> >> > > >> >>>>>> children <-
data.frame(parent=c("Sarah", "Max",
    >> "Qin"),
    >> >> > > >> >>>>>>               
name=c("Oliver", "Sebastian",
    >> >> > "Kai-lee"),
    >> >> > > >> >>>>>>               
sex=c("M", "M", "F"),
    >> >> > > >> >>>>>>               
age=c(5,8,7))
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> # Merge()
creates a duplicated "name" column:
    >> >> > > >> >>>>>> merge(parents,
children, by.x = "name", by.y     >> "parent")
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Output:
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>     name sex.x
age.x      name sex.y age.y
    >> >> > > >> >>>>>> 1   Max     M 
43 Sebastian     M     8
    >> >> > > >> >>>>>> 2   Qin     F 
36   Kai-lee     F     7
    >> >> > > >> >>>>>> 3 Sarah     F 
41    Oliver     M     5
    >> >> > > >> >>>>>> Warning
message:
    >> >> > > >> >>>>>> In
merge.data.frame(parents, children, by.x = "name",
    >> by.y     >> >> > > >>
>>>>>> "parent") :
    >> >> > > >> >>>>>>    column name
?name? is duplicated in the result
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Kind Regards,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Scott Ritchie
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>       
[[alternative HTML version deleted]]
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>
______________________________________________
    >> >> > > >> >>>>>> R-devel at
r-project.org mailing list
    >> >> > > >> >>>>>>
https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>
    >> >> > > >> >>>>
    >> >> > > >> >>>>
    >> >> > > >> >> Index:
src/library/base/R/merge.R
    >> >> > > >> >>>
=============================    >> =============================   
>> >> > ======    >> >> > > >>
>>> --- src/library/base/R/merge.R  (revision 74264)
    >> >> > > >> >>> +++
src/library/base/R/merge.R  (working copy)
    >> >> > > >> >>> @@ -157,6 +157,15 @@
    >> >> > > >> >>>           }
    >> >> > > >> >>>            
if(has.common.nms) names(y) <- nm.y
    >> >> > > >> >>> +        ## If by.x %in%
names(y) then duplicate column
    >> names
    >> >> > still
    >> >> > > >> >>> arise,
    >> >> > > >> >>> +        ## apply suffixes
to these
    >> >> > > >> >>> +        dupe.keyx <-
intersect(nm.by, names(y))
    >> >> > > >> >>> +       
if(length(dupe.keyx)) {
    >> >> > > >> >>> +         
if(nzchar(suffixes[1L]))
    >> >> > > >> >>> +           
names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx,
suffixes[1L], sep="")
    >> >> > > >> >>> +         
if(nzchar(suffixes[2L]))
    >> >> > > >> >>> +           
names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx,
suffixes[2L], sep="")
    >> >> > > >> >>> +        }
    >> >> > > >> >>>           nm <-
c(names(x), names(y))
    >> >> > > >> >>>           if(any(d <-
duplicated(nm)))
    >> >> > > >> >>>               if(sum(d)
> 1L)
    >> >> > > >> >>>
    >> >> > > >> >>
    >> >> > > >> >>
______________________________________________
    >> >> > > >> >> R-devel at r-project.org
mailing list
    >> >> > > >> >>
https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >>
    >> >> > > >>         [[alternative HTML version
deleted]]
    >> >> > > >>
    >> >> > > >>
______________________________________________
    >> >> > > >> R-devel at r-project.org mailing list
    >> >> > > >>
https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >>
    >> >> > > >
    >> >> > > >
    >> >> > > >
    >> >> > > > --
    >> >> > > > Gabriel Becker, PhD
    >> >> > > > Scientist (Bioinformatics)
    >> >> > > > Genentech Research
    >> >> > > >
    >> >> >
    >> >> > > Index: src/library/base/R/merge.R
    >> >> > >
===========================================================    >> ======  
>> >> > > --- src/library/base/R/merge.R        (revision
74264)
    >> >> > > +++ src/library/base/R/merge.R        (working
copy)
    >> >> > > @@ -157,6 +157,15 @@
    >> >> > >          }
    >> >> > >
    >> >> > >          if(has.common.nms) names(y) <- nm.y
    >> >> > > +        ## If by.x %in% names(y) then duplicate
column names
    >> still
    >> >> > arise,
    >> >> > > +        ## apply suffixes to these
    >> >> > > +        dupe.keyx <- intersect(nm.by,
names(y))
    >> >> > > +        if(length(dupe.keyx)) {
    >> >> > > +          if(nzchar(suffixes[1L]))
    >> >> > > +            names(x)[match(dupe.keyx, names(x),
0L)] <-
    >> >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> >> > > +          if(nzchar(suffixes[2L]))
    >> >> > > +            names(y)[match(dupe.keyx, names(y),
0L)] <-
    >> >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> >> > > +        }
    >> >> > >          nm <- c(names(x), names(y))
    >> >> > >          if(any(d <- duplicated(nm)))
    >> >> > >              if(sum(d) > 1L)
    >> >> >
    >> >> > > ______________________________________________
    >> >> > > R-devel at r-project.org mailing list
    >> >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> >
    >> >> >
    >> 
    >> >> Index: src/library/base/R/merge.R
    >> >>
==================================================================    >>
>> --- src/library/base/R/merge.R        (revision 74280)
    >> >> +++ src/library/base/R/merge.R        (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column
names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps
backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting
data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)]
<-
    >> paste(dupe.keyx,
    >> > suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >> 
    >> > [[alternative HTML version deleted]]
    >> 
    >> > ______________________________________________
    >> > R-devel at r-project.org mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 

    > [[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more apparently analagous threads

R devel - Feb 2018 - Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Seemingly Similar Threads