thr3ads.net - R devel - [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y [Feb 2018]

If this information is useful, please help other people find it:
Share via:

frederik at ofb.net

2018-Feb-17 23:36 UTC

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Hi Scott,

Thanks for the patch. I'm not really involved in R development; it
will be up to someone in the R core team to apply it. I would hazard
to say that even if correct (I haven't checked), it will not be
applied because the change might break existing code. For example it
seems like reasonable code might easily assume that a column with the
same name as "by.x" exists in the output of 'merge'.
That's just my
best guess... I don't participate on here often.

Cheers,

Frederick

On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
wrote:> The attached patch.diff will make merge.data.frame() append the suffixes to
> columns with common names between by.x and names(y).
> 
> Best,
> 
> Scott Ritchie
> 
> On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at
gmail.com> wrote:
> 
> > Hi Frederick,
> >
> > I would expect that any duplicate names in the resulting data.frame
would
> > have the suffixes appended to them, regardless of whether or not they
are
> > used as the join key. So in my example I would expect
"names.x" and
> > "names.y" to indicate their source data.frame.
> >
> > While careful reading of the documentation reveals this is not the
case, I
> > would argue the intent of the suffixes functionality should equally be
> > applied to this type of case.
> >
> > If you agree this would be useful, I'm happy to write a patch for
> > merge.data.frame that will add suffixes in this case - I intend to do
the
> > same for merge.data.table in the data.table package where I initially
> > encountered the edge case.
> >
> > Best,
> >
> > Scott
> >
> > On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
> >
> >> Hi Scott,
> >>
> >> It seems like reasonable behavior to me. What result would you
expect?
> >> That the second "name" should be called
"name.y"?
> >>
> >> The "merge" documentation says:
> >>
> >>     If the columns in the data frames not used in merging have any
> >>     common names, these have ?suffixes? (?".x"? and
?".y"? by default)
> >>     appended to try to make the names of the result unique.
> >>
> >> Since the first "name" column was used in merging,
leaving both
> >> without a suffix seems consistent with the documentation...
> >>
> >> Frederick
> >>
> >> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> >> > Hi,
> >> >
> >> > I was unable to find a bug report for this with a cursory
search, but
> >> would
> >> > like clarification if this is intended or unavoidable
behaviour:
> >> >
> >> > ```{r}
> >> > # Create example data.frames
> >> > parents <- data.frame(name=c("Sarah",
"Max", "Qin", "Lex"),
> >> >                       sex=c("F", "M",
"F", "M"),
> >> >                       age=c(41, 43, 36, 51))
> >> > children <- data.frame(parent=c("Sarah",
"Max", "Qin"),
> >> >                        name=c("Oliver",
"Sebastian", "Kai-lee"),
> >> >                        sex=c("M", "M",
"F"),
> >> >                        age=c(5,8,7))
> >> >
> >> > # Merge() creates a duplicated "name" column:
> >> > merge(parents, children, by.x = "name", by.y =
"parent")
> >> > ```
> >> >
> >> > Output:
> >> > ```
> >> >    name sex.x age.x      name sex.y age.y
> >> > 1   Max     M    43 Sebastian     M     8
> >> > 2   Qin     F    36   Kai-lee     F     7
> >> > 3 Sarah     F    41    Oliver     M     5
> >> > Warning message:
> >> > In merge.data.frame(parents, children, by.x =
"name", by.y = "parent") :
> >> >   column name ?name? is duplicated in the result
> >> > ```
> >> >
> >> > Kind Regards,
> >> >
> >> > Scott Ritchie
> >> >
> >> >       [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-devel at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >
> >>
> >
> >
> Index: src/library/base/R/merge.R
> ==================================================================> ---
src/library/base/R/merge.R	(revision 74264)
> +++ src/library/base/R/merge.R	(working copy)
> @@ -157,6 +157,15 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to these
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[1L]))
> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

Duncan Murdoch

2018-Feb-18 00:48 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

On 17/02/2018 6:36 PM, frederik at ofb.net wrote:> Hi Scott,
> 
> Thanks for the patch. I'm not really involved in R development; it
> will be up to someone in the R core team to apply it. I would hazard
> to say that even if correct (I haven't checked), it will not be
> applied because the change might break existing code. For example it
> seems like reasonable code might easily assume that a column with the
> same name as "by.x" exists in the output of 'merge'.
That's just my
> best guess... I don't participate on here often.

I think you're right.  If I were still a member of R Core, I would want 
to test this against all packages on CRAN and Bioconductor, and since 
that test takes a couple of days to run on my laptop, I'd probably never 
get around to it.

There are lots of cases where "I would have done that differently",
but
most of them are far too much trouble to change now that R is more than 
20 years old.  And in many cases it will turn out that the way R does it 
actually does make more sense than the way I would have done it.

Duncan Murdoch
> 
> Cheers,
> 
> Frederick
> 
> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>> The attached patch.diff will make merge.data.frame() append the
suffixes to
>> columns with common names between by.x and names(y).
>>
>> Best,
>>
>> Scott Ritchie
>>
>> On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at
gmail.com> wrote:
>>
>>> Hi Frederick,
>>>
>>> I would expect that any duplicate names in the resulting data.frame
would
>>> have the suffixes appended to them, regardless of whether or not
they are
>>> used as the join key. So in my example I would expect
"names.x" and
>>> "names.y" to indicate their source data.frame.
>>>
>>> While careful reading of the documentation reveals this is not the
case, I
>>> would argue the intent of the suffixes functionality should equally
be
>>> applied to this type of case.
>>>
>>> If you agree this would be useful, I'm happy to write a patch
for
>>> merge.data.frame that will add suffixes in this case - I intend to
do the
>>> same for merge.data.table in the data.table package where I
initially
>>> encountered the edge case.
>>>
>>> Best,
>>>
>>> Scott
>>>
>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
>>>
>>>> Hi Scott,
>>>>
>>>> It seems like reasonable behavior to me. What result would you
expect?
>>>> That the second "name" should be called
"name.y"?
>>>>
>>>> The "merge" documentation says:
>>>>
>>>>      If the columns in the data frames not used in merging have
any
>>>>      common names, these have ?suffixes? (?".x"? and
?".y"? by default)
>>>>      appended to try to make the names of the result unique.
>>>>
>>>> Since the first "name" column was used in merging,
leaving both
>>>> without a suffix seems consistent with the documentation...
>>>>
>>>> Frederick
>>>>
>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
>>>>> Hi,
>>>>>
>>>>> I was unable to find a bug report for this with a cursory
search, but
>>>> would
>>>>> like clarification if this is intended or unavoidable
behaviour:
>>>>>
>>>>> ```{r}
>>>>> # Create example data.frames
>>>>> parents <- data.frame(name=c("Sarah",
"Max", "Qin", "Lex"),
>>>>>                        sex=c("F", "M",
"F", "M"),
>>>>>                        age=c(41, 43, 36, 51))
>>>>> children <- data.frame(parent=c("Sarah",
"Max", "Qin"),
>>>>>                         name=c("Oliver",
"Sebastian", "Kai-lee"),
>>>>>                         sex=c("M", "M",
"F"),
>>>>>                         age=c(5,8,7))
>>>>>
>>>>> # Merge() creates a duplicated "name" column:
>>>>> merge(parents, children, by.x = "name", by.y =
"parent")
>>>>> ```
>>>>>
>>>>> Output:
>>>>> ```
>>>>>     name sex.x age.x      name sex.y age.y
>>>>> 1   Max     M    43 Sebastian     M     8
>>>>> 2   Qin     F    36   Kai-lee     F     7
>>>>> 3 Sarah     F    41    Oliver     M     5
>>>>> Warning message:
>>>>> In merge.data.frame(parents, children, by.x =
"name", by.y = "parent") :
>>>>>    column name ?name? is duplicated in the result
>>>>> ```
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Scott Ritchie
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>
>>>
>>>
> 
>> Index: src/library/base/R/merge.R
>>
==================================================================>> ---
src/library/base/R/merge.R	(revision 74264)
>> +++ src/library/base/R/merge.R	(working copy)
>> @@ -157,6 +157,15 @@
>>           }
>>   
>>           if(has.common.nms) names(y) <- nm.y
>> +        ## If by.x %in% names(y) then duplicate column names still
arise,
>> +        ## apply suffixes to these
>> +        dupe.keyx <- intersect(nm.by, names(y))
>> +        if(length(dupe.keyx)) {
>> +          if(nzchar(suffixes[1L]))
>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
paste(dupe.keyx, suffixes[1L], sep="")
>> +          if(nzchar(suffixes[2L]))
>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
paste(dupe.keyx, suffixes[2L], sep="")
>> +        }
>>           nm <- c(names(x), names(y))
>>           if(any(d <- duplicated(nm)))
>>               if(sum(d) > 1L)
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Scott Ritchie

2018-Feb-18 02:50 UTC

head link

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Thanks Duncan and Frederick,

I suspected as much - there doesn't appear to be any reason why conflicts
between by.x and names(y) shouldn't and cannot be checked, but I can see
how this might be more trouble than its worth given it potentially may
break downstream packages (i.e. any cases where this occurs but they expect
the name of the key column(s) to remain the same).

Best,

Scott

On 18 February 2018 at 11:48, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
>
>> Hi Scott,
>>
>> Thanks for the patch. I'm not really involved in R development; it
>> will be up to someone in the R core team to apply it. I would hazard
>> to say that even if correct (I haven't checked), it will not be
>> applied because the change might break existing code. For example it
>> seems like reasonable code might easily assume that a column with the
>> same name as "by.x" exists in the output of 'merge'.
That's just my
>> best guess... I don't participate on here often.
>>
>
>
> I think you're right.  If I were still a member of R Core, I would want
to
> test this against all packages on CRAN and Bioconductor, and since that
> test takes a couple of days to run on my laptop, I'd probably never get
> around to it.
>
> There are lots of cases where "I would have done that
differently", but
> most of them are far too much trouble to change now that R is more than 20
> years old.  And in many cases it will turn out that the way R does it
> actually does make more sense than the way I would have done it.
>
> Duncan Murdoch
>
>
>
>> Cheers,
>>
>> Frederick
>>
>> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>>
>>> The attached patch.diff will make merge.data.frame() append the
suffixes
>>> to
>>> columns with common names between by.x and names(y).
>>>
>>> Best,
>>>
>>> Scott Ritchie
>>>
>>> On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at
gmail.com>
>>> wrote:
>>>
>>> Hi Frederick,
>>>>
>>>> I would expect that any duplicate names in the resulting
data.frame
>>>> would
>>>> have the suffixes appended to them, regardless of whether or
not they
>>>> are
>>>> used as the join key. So in my example I would expect
"names.x" and
>>>> "names.y" to indicate their source data.frame.
>>>>
>>>> While careful reading of the documentation reveals this is not
the
>>>> case, I
>>>> would argue the intent of the suffixes functionality should
equally be
>>>> applied to this type of case.
>>>>
>>>> If you agree this would be useful, I'm happy to write a
patch for
>>>> merge.data.frame that will add suffixes in this case - I intend
to do
>>>> the
>>>> same for merge.data.table in the data.table package where I
initially
>>>> encountered the edge case.
>>>>
>>>> Best,
>>>>
>>>> Scott
>>>>
>>>> On 17 February 2018 at 03:53, <frederik at ofb.net>
wrote:
>>>>
>>>> Hi Scott,
>>>>>
>>>>> It seems like reasonable behavior to me. What result would
you expect?
>>>>> That the second "name" should be called
"name.y"?
>>>>>
>>>>> The "merge" documentation says:
>>>>>
>>>>>      If the columns in the data frames not used in merging
have any
>>>>>      common names, these have ?suffixes? (?".x"?
and ?".y"? by default)
>>>>>      appended to try to make the names of the result
unique.
>>>>>
>>>>> Since the first "name" column was used in
merging, leaving both
>>>>> without a suffix seems consistent with the documentation...
>>>>>
>>>>> Frederick
>>>>>
>>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I was unable to find a bug report for this with a
cursory search, but
>>>>>>
>>>>> would
>>>>>
>>>>>> like clarification if this is intended or unavoidable
behaviour:
>>>>>>
>>>>>> ```{r}
>>>>>> # Create example data.frames
>>>>>> parents <- data.frame(name=c("Sarah",
"Max", "Qin", "Lex"),
>>>>>>                        sex=c("F",
"M", "F", "M"),
>>>>>>                        age=c(41, 43, 36, 51))
>>>>>> children <- data.frame(parent=c("Sarah",
"Max", "Qin"),
>>>>>>                         name=c("Oliver",
"Sebastian", "Kai-lee"),
>>>>>>                         sex=c("M",
"M", "F"),
>>>>>>                         age=c(5,8,7))
>>>>>>
>>>>>> # Merge() creates a duplicated "name" column:
>>>>>> merge(parents, children, by.x = "name", by.y
= "parent")
>>>>>> ```
>>>>>>
>>>>>> Output:
>>>>>> ```
>>>>>>     name sex.x age.x      name sex.y age.y
>>>>>> 1   Max     M    43 Sebastian     M     8
>>>>>> 2   Qin     F    36   Kai-lee     F     7
>>>>>> 3 Sarah     F    41    Oliver     M     5
>>>>>> Warning message:
>>>>>> In merge.data.frame(parents, children, by.x =
"name", by.y >>>>>> "parent") :
>>>>>>    column name ?name? is duplicated in the result
>>>>>> ```
>>>>>>
>>>>>> Kind Regards,
>>>>>>
>>>>>> Scott Ritchie
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>> Index: src/library/base/R/merge.R
>>>
==================================================================>>>
--- src/library/base/R/merge.R  (revision 74264)
>>> +++ src/library/base/R/merge.R  (working copy)
>>> @@ -157,6 +157,15 @@
>>>           }
>>>             if(has.common.nms) names(y) <- nm.y
>>> +        ## If by.x %in% names(y) then duplicate column names still
>>> arise,
>>> +        ## apply suffixes to these
>>> +        dupe.keyx <- intersect(nm.by, names(y))
>>> +        if(length(dupe.keyx)) {
>>> +          if(nzchar(suffixes[1L]))
>>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
>>> paste(dupe.keyx, suffixes[1L], sep="")
>>> +          if(nzchar(suffixes[2L]))
>>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
>>> paste(dupe.keyx, suffixes[2L], sep="")
>>> +        }
>>>           nm <- c(names(x), names(y))
>>>           if(any(d <- duplicated(nm)))
>>>               if(sum(d) > 1L)
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Feb 2018 - Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Apparently Analagous Threads