Scott Ritchie
2018-Feb-17 00:15 UTC
[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Hi Frederick, I would expect that any duplicate names in the resulting data.frame would have the suffixes appended to them, regardless of whether or not they are used as the join key. So in my example I would expect "names.x" and "names.y" to indicate their source data.frame. While careful reading of the documentation reveals this is not the case, I would argue the intent of the suffixes functionality should equally be applied to this type of case. If you agree this would be useful, I'm happy to write a patch for merge.data.frame that will add suffixes in this case - I intend to do the same for merge.data.table in the data.table package where I initially encountered the edge case. Best, Scott On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:> Hi Scott, > > It seems like reasonable behavior to me. What result would you expect? > That the second "name" should be called "name.y"? > > The "merge" documentation says: > > If the columns in the data frames not used in merging have any > common names, these have ?suffixes? (?".x"? and ?".y"? by default) > appended to try to make the names of the result unique. > > Since the first "name" column was used in merging, leaving both > without a suffix seems consistent with the documentation... > > Frederick > > On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote: > > Hi, > > > > I was unable to find a bug report for this with a cursory search, but > would > > like clarification if this is intended or unavoidable behaviour: > > > > ```{r} > > # Create example data.frames > > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), > > sex=c("F", "M", "F", "M"), > > age=c(41, 43, 36, 51)) > > children <- data.frame(parent=c("Sarah", "Max", "Qin"), > > name=c("Oliver", "Sebastian", "Kai-lee"), > > sex=c("M", "M", "F"), > > age=c(5,8,7)) > > > > # Merge() creates a duplicated "name" column: > > merge(parents, children, by.x = "name", by.y = "parent") > > ``` > > > > Output: > > ``` > > name sex.x age.x name sex.y age.y > > 1 Max M 43 Sebastian M 8 > > 2 Qin F 36 Kai-lee F 7 > > 3 Sarah F 41 Oliver M 5 > > Warning message: > > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") : > > column name ?name? is duplicated in the result > > ``` > > > > Kind Regards, > > > > Scott Ritchie > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > >[[alternative HTML version deleted]]
Scott Ritchie
2018-Feb-17 05:42 UTC
[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
The attached patch.diff will make merge.data.frame() append the suffixes to columns with common names between by.x and names(y). Best, Scott Ritchie On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote:> Hi Frederick, > > I would expect that any duplicate names in the resulting data.frame would > have the suffixes appended to them, regardless of whether or not they are > used as the join key. So in my example I would expect "names.x" and > "names.y" to indicate their source data.frame. > > While careful reading of the documentation reveals this is not the case, I > would argue the intent of the suffixes functionality should equally be > applied to this type of case. > > If you agree this would be useful, I'm happy to write a patch for > merge.data.frame that will add suffixes in this case - I intend to do the > same for merge.data.table in the data.table package where I initially > encountered the edge case. > > Best, > > Scott > > On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: > >> Hi Scott, >> >> It seems like reasonable behavior to me. What result would you expect? >> That the second "name" should be called "name.y"? >> >> The "merge" documentation says: >> >> If the columns in the data frames not used in merging have any >> common names, these have ?suffixes? (?".x"? and ?".y"? by default) >> appended to try to make the names of the result unique. >> >> Since the first "name" column was used in merging, leaving both >> without a suffix seems consistent with the documentation... >> >> Frederick >> >> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote: >> > Hi, >> > >> > I was unable to find a bug report for this with a cursory search, but >> would >> > like clarification if this is intended or unavoidable behaviour: >> > >> > ```{r} >> > # Create example data.frames >> > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), >> > sex=c("F", "M", "F", "M"), >> > age=c(41, 43, 36, 51)) >> > children <- data.frame(parent=c("Sarah", "Max", "Qin"), >> > name=c("Oliver", "Sebastian", "Kai-lee"), >> > sex=c("M", "M", "F"), >> > age=c(5,8,7)) >> > >> > # Merge() creates a duplicated "name" column: >> > merge(parents, children, by.x = "name", by.y = "parent") >> > ``` >> > >> > Output: >> > ``` >> > name sex.x age.x name sex.y age.y >> > 1 Max M 43 Sebastian M 8 >> > 2 Qin F 36 Kai-lee F 7 >> > 3 Sarah F 41 Oliver M 5 >> > Warning message: >> > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") : >> > column name ?name? is duplicated in the result >> > ``` >> > >> > Kind Regards, >> > >> > Scott Ritchie >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-devel at r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> > >-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch.diff URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180217/efabe387/attachment.ksh>
frederik at ofb.net
2018-Feb-17 23:36 UTC
[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Hi Scott, Thanks for the patch. I'm not really involved in R development; it will be up to someone in the R core team to apply it. I would hazard to say that even if correct (I haven't checked), it will not be applied because the change might break existing code. For example it seems like reasonable code might easily assume that a column with the same name as "by.x" exists in the output of 'merge'. That's just my best guess... I don't participate on here often. Cheers, Frederick On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:> The attached patch.diff will make merge.data.frame() append the suffixes to > columns with common names between by.x and names(y). > > Best, > > Scott Ritchie > > On 17 February 2018 at 11:15, Scott Ritchie <s.ritchie73 at gmail.com> wrote: > > > Hi Frederick, > > > > I would expect that any duplicate names in the resulting data.frame would > > have the suffixes appended to them, regardless of whether or not they are > > used as the join key. So in my example I would expect "names.x" and > > "names.y" to indicate their source data.frame. > > > > While careful reading of the documentation reveals this is not the case, I > > would argue the intent of the suffixes functionality should equally be > > applied to this type of case. > > > > If you agree this would be useful, I'm happy to write a patch for > > merge.data.frame that will add suffixes in this case - I intend to do the > > same for merge.data.table in the data.table package where I initially > > encountered the edge case. > > > > Best, > > > > Scott > > > > On 17 February 2018 at 03:53, <frederik at ofb.net> wrote: > > > >> Hi Scott, > >> > >> It seems like reasonable behavior to me. What result would you expect? > >> That the second "name" should be called "name.y"? > >> > >> The "merge" documentation says: > >> > >> If the columns in the data frames not used in merging have any > >> common names, these have ?suffixes? (?".x"? and ?".y"? by default) > >> appended to try to make the names of the result unique. > >> > >> Since the first "name" column was used in merging, leaving both > >> without a suffix seems consistent with the documentation... > >> > >> Frederick > >> > >> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote: > >> > Hi, > >> > > >> > I was unable to find a bug report for this with a cursory search, but > >> would > >> > like clarification if this is intended or unavoidable behaviour: > >> > > >> > ```{r} > >> > # Create example data.frames > >> > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"), > >> > sex=c("F", "M", "F", "M"), > >> > age=c(41, 43, 36, 51)) > >> > children <- data.frame(parent=c("Sarah", "Max", "Qin"), > >> > name=c("Oliver", "Sebastian", "Kai-lee"), > >> > sex=c("M", "M", "F"), > >> > age=c(5,8,7)) > >> > > >> > # Merge() creates a duplicated "name" column: > >> > merge(parents, children, by.x = "name", by.y = "parent") > >> > ``` > >> > > >> > Output: > >> > ``` > >> > name sex.x age.x name sex.y age.y > >> > 1 Max M 43 Sebastian M 8 > >> > 2 Qin F 36 Kai-lee F 7 > >> > 3 Sarah F 41 Oliver M 5 > >> > Warning message: > >> > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") : > >> > column name ?name? is duplicated in the result > >> > ``` > >> > > >> > Kind Regards, > >> > > >> > Scott Ritchie > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-devel at r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > >> > > > >> Index: src/library/base/R/merge.R > ==================================================================> --- src/library/base/R/merge.R (revision 74264) > +++ src/library/base/R/merge.R (working copy) > @@ -157,6 +157,15 @@ > } > > if(has.common.nms) names(y) <- nm.y > + ## If by.x %in% names(y) then duplicate column names still arise, > + ## apply suffixes to these > + dupe.keyx <- intersect(nm.by, names(y)) > + if(length(dupe.keyx)) { > + if(nzchar(suffixes[1L])) > + names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="") > + if(nzchar(suffixes[2L])) > + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="") > + } > nm <- c(names(x), names(y)) > if(any(d <- duplicated(nm))) > if(sum(d) > 1L)
Apparently Analagous Threads
- Duplicate column names created by base::merge() when by.x has the same name as a column in y
- Duplicate column names created by base::merge() when by.x has the same name as a column in y
- Duplicate column names created by base::merge() when by.x has the same name as a column in y
- Duplicate column names created by base::merge() when by.x has the same name as a column in y
- Duplicate column names created by base::merge() when by.x has the same name as a column in y