Simon Anders
2008-Mar-14 17:16 UTC
[Rd] 'merge' function: behavior w.r.t. NAs in the key column
Hi, I recently ran into a problem with 'merge' that stems from the way how missing values in the key column (i.e., the column specified in the "by" argument) are handled. I wonder whether the current behavior is fully consistent. Please have a look at this example:> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 ) > y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )> xkey val 1 1 11 2 2 12 3 3 13 4 3 14 5 NA 15 6 NA 16> ykey val 1 NA 21 2 2 22 3 3 23 4 4 24 5 5 25 6 3 26 7 NA 27> merge( x, y, by="key" )key val.x val.y 1 2 12 22 2 3 13 23 3 3 13 26 4 3 14 23 5 3 14 26 6 NA 15 21 7 NA 15 27 8 NA 16 21 9 NA 16 27 As one should expect, there are now four lines with key value '3', because the key '3' appears twice both in x and in y. According to the logic of merge, a row should be produced in the output for each pairing of a row from x and a row from y where the values of 'key' are equal. However, the 'NA' values are treated exactly the same way. It seems that 'merge' considers the pairing of lines with 'NA' in both 'key' columns an allowed match. IMHO, this runs against the convention that two NAs are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.) Is might be more consistent if merge did not include any rows into the output with an "NA" in the key column. Maybe, one could add a flag argument to 'merge' to switch between this behaviour and the current one? A note in the help page might be nice, too. Best regards Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | office phone +44-1223-494478, mobile phone +44-7505-841692 | preferred (permanent) e-mail: sanders at fs.tum.de
Bill Dunlap
2008-Mar-14 23:57 UTC
[Rd] 'merge' function: behavior w.r.t. NAs in the key column
On Fri, 14 Mar 2008, Simon Anders wrote:> I recently ran into a problem with 'merge' that stems from the way how > missing values in the key column (i.e., the column specified > in the "by" argument) are handled. I wonder whether the current behavior > is fully consistent. > ... > > x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 ) > > y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 ) > ... > > merge( x, y, by="key" ) > key val.x val.y > 1 2 12 22 > 2 3 13 23 > 3 3 13 26 > 4 3 14 23 > 5 3 14 26 > 6 NA 15 21 > 7 NA 15 27 > 8 NA 16 21 > 9 NA 16 27 > > As one should expect, there are now four lines with key value '3', > because the key '3' appears twice both in x and in y. According to the > logic of merge, a row should be produced in the output for each pairing > of a row from x and a row from y where the values of 'key' are equal. > > However, the 'NA' values are treated exactly the same way. It seems that > 'merge' considers the pairing of lines with 'NA' in both 'key' columns > an allowed match. IMHO, this runs against the convention that two NAs > are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.) > > Is might be more consistent if merge did not include any rows into the > output with an "NA" in the key column. > > Maybe, one could add a flag argument to 'merge' to switch between this > behaviour and the current one? A note in the help page might be nice, too.Splus (versions 8.0, 7.0, and 6.2) gives: > merge( x, y, by="key" ) key val.x val.y 1 2 12 22 2 3 13 23 3 3 14 23 4 3 13 26 5 3 14 26 Is that what you expect? There is no argument to Splus's merge to make it include the NA's in the way R's merge does. Should there be such an argument? ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."