Antoine Fabri
2021-Jan-02 19:15 UTC
[Rd] `merge()` not consistent in how it treats list columns
Dear R-devel, When trying to merge 2 data frames by an "id" column, with this column a character in one of them, and a list of character in the other, merge behaves differently depending which is given first. Example : ``` df1 <- data.frame(a=1) df2 <- data.frame(b=2) df1$id <- "ID" df2$id <- list("ID") # these print in a similar way, so the upcoming error will be hard to diagnose df1 #> a id #> 1 1 ID df2 #> b id #> 1 2 ID # especially as this works well, df2$id is treated as an atomic vector merge(df1, df2) #> id a b #> 1 ID 1 2 # But this fails with a cryptic error message merge(df2, df1) #> Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and "quick" #> Have you called 'sort' on a list? ``` I believe that if we let it work one way it should work the other, and that if it works neither an explicit error mentioning how we can't join by list column would be helpful. Many thanks and happy new year to all the R community, Antoine [[alternative HTML version deleted]]
Avi Gross
2021-Jan-02 20:56 UTC
[Rd] `merge()` not consistent in how it treats list columns
Antoine, Have you considered converting the non-list to a list explicitly so this does not matter? For a long time, few people used lists in this context, albeit in the tidyverse it is now better supported and probably more common. This is an area many have found annoying when you have implicit conversions. What if one ID field was character and the other was numeric? In some languages the conversion always goes to character (as in R) but in some it might go numeric in one direction and in some it may refuse and demand you convert it yourself. Do you suggest that a unique solution exists for complex cases so that the software should know you want to convert a vector to list? What if one side is a list containing a list containing a list, many levels deep and the other has no or fewer or more levels. Is it obvious to take the deepest case and change all others to match? Do you lose things in the process? When things may not work, sure you can suggest someone change, but you can consider it as a case where YOU should make sure the types are compatible before a merge. -----Original Message----- From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Antoine Fabri Sent: Saturday, January 2, 2021 2:16 PM To: R-devel <r-devel at r-project.org> Subject: [Rd] `merge()` not consistent in how it treats list columns Dear R-devel, When trying to merge 2 data frames by an "id" column, with this column a character in one of them, and a list of character in the other, merge behaves differently depending which is given first. Example : ``` df1 <- data.frame(a=1) df2 <- data.frame(b=2) df1$id <- "ID" df2$id <- list("ID") # these print in a similar way, so the upcoming error will be hard to diagnose df1 #> a id #> 1 1 ID df2 #> b id #> 1 2 ID # especially as this works well, df2$id is treated as an atomic vector merge(df1, df2) #> id a b #> 1 ID 1 2 # But this fails with a cryptic error message merge(df2, df1) #> Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and "quick" #> Have you called 'sort' on a list? ``` I believe that if we let it work one way it should work the other, and that if it works neither an explicit error mentioning how we can't join by list column would be helpful. Many thanks and happy new year to all the R community, Antoine [[alternative HTML version deleted]] ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Gabriel Becker
2021-Jan-03 02:58 UTC
[Rd] `merge()` not consistent in how it treats list columns
Hi Antoine, On Sat, Jan 2, 2021 at 11:16 AM Antoine Fabri <antoine.fabri at gmail.com> wrote:> Dear R-devel, > > When trying to merge 2 data frames by an "id" column, with this column a > character in one of them, and a list of character in the other, merge > behaves differently depending which is given first. > > Example : > > ``` > df1 <- data.frame(a=1) > df2 <- data.frame(b=2) > df1$id <- "ID" > df2$id <- list("ID") > > # these print in a similar way, so the upcoming error will be hard to > diagnose > df1 > #> a id > #> 1 1 ID > df2 > #> b id > #> 1 2 ID > > # especially as this works well, df2$id is treated as an atomic vector > merge(df1, df2) > #> id a b > #> 1 ID 1 2 >Well, sure but that is because it happens to be a list with each element having length one. In which case, it really should not have been a list at all, and the fact that it was seems a deeper problem that should likely be resolved instead of treating the symptom, in my opinion. > df1 <- data.frame(a=1)> df2 <- data.frame(b=2)> df1$id <- "ID"> df2$id <- list(c("ID", "ID2"))> merge(df1, df2)[1] id a b <0 rows> (or 0-length row.names) Thats probably not what you wanted it to do, right? Or maybe it is, it depends, right?. And therein lies the rub. I have to be honest, as a developer, I really wish this, even in your example case, threw an error. Anything else just looks to me like a debugging nightmare looming in the wings waiting to strike.> # But this fails with a cryptic error message > merge(df2, df1) > #> Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method > "shell" and "quick" > #> Have you called 'sort' on a list? > ``` > > I believe that if we let it work one way it should work the other, and that > if it works neither an explicit error mentioning how we can't join by list > column would be helpful. >There's no reason (in principle) you wouldn't be able to join by a list column, they should just both have to be list columns, in my ideal (but admittedly unlikely) world. Id rather the atomic-vector/list mismatch case throw an error, myself. Now I kind of doubt we can change the behavior that works now, but as Avi points out, I think this is something that is complicated and case specific enough that it really ought to be your job as the coder to take care of what should happen when you try to merge on columns that are fundamentally different types. Plus, having an id column as a list, unless it was really explicitly intentional, seems very likely to be a bug to me. (I mean id column in the you want to use it to merge things, not the fact that itw as called "id", though admittedly those are likely to go together... Best, ~G> > Many thanks and happy new year to all the R community, > > Antoine > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]