Eric DeWitt
2013-May-15 23:58 UTC
[R] Unexpected behavior using `merge' by multiple columns
I recently started using `merge()' to combine data frames that contained different conditioned subsets of the same master data frame. In some conditions the `by' columns had `NA' values and I was using `incomparables=c(NA, NaN)' to avoid including these rows in the resulting merge. However, the behavior never appeared to match the documentation. I continued to received partial matching on columns that included `NA's. I attempted to understand the expected behavior by looking at the example from the documentation and it appears to me to have the same problem. Am I misinterpreting the documentation or is the behavior inconsistent? Below is a reproducible example from the `merge' documentation. The modified version run at the end uses the fix suggested in the example. Best, Eric ### base::merge potentially unexpected behavior # The `merge' argument `incomparables' does not appear to behave in the manner # described in the documentation. Including `NA' in comparables does not prevent # NA matching. # example from the documentation: ## example of using `incomparables' x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) merge(x, y, by = c("k1","k2")) # NA's match k1 k2 data.x data.y 1 4 4 4 4 2 5 5 5 5 3 NA NA 2 1 merge(x, y, by = c("k1","k2"), incomparables = NA) k1 k2 data.x data.y 1 4 4 4 4 2 5 5 5 5 3 NA NA 2 1 merge(x, y, by = "k1") # NA's match, so 6 rows k1 k2.x data.x k2.y data.y 1 4 4 4 4 4 2 5 5 5 5 5 3 NA 1 1 NA 1 4 NA 1 1 3 3 5 NA NA 2 NA 1 6 NA NA 2 3 3 merge(x, y, by = "k2", incomparables = NA) # 2 rows k2 k1.x data.x k1.y data.y 1 4 4 4 4 4 2 5 5 5 5 5 # Observing that when merging on two columns the `incomparables' appears to have # no effect, it appears that the problem is located in the creation of the # common index into the two data frames. debugging merge reveals that the # problem is in the construction via paste which converts `NA' to "NA": # # debug: bz <- do.call("paste", c(rbind(bx, by), sep = "\r")) # Browse[2]bz # [1] "NA\r1" "NA\rNA" "3\rNA" "4\r4" "5\r5" "NA\rNA" "2\rNA" "NA\r3" # [9] "4\r4" "5\r5" # debug: bx <- bz[seq_len(nx)] # debug: by <- bz[nx + seq_len(ny)] # Browse[2]bx # [1] "NA\r1" "NA\rNA" "3\rNA" "4\r4" "5\r5" # Browse[2]by # [1] "NA\rNA" "2\rNA" "NA\r3" "4\r4" "5\r5" # # which produces "NA\rNA" matches # # debug: comm <- match(bx, by, 0L) # Browse[2]comm # [1] 0 1 0 4 5 # # Given that this appears to the be intended behavior of `paste', the solution # appears to require that the elements in `incomparables' be removed from the # rows after to the paste operation. The following is an example (perhaps # inefficient) that would solve the problem: # # bx <- x[, by.x, drop = FALSE] # by <- y[, by.y, drop = FALSE] # names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx))) # bz <- do.call("paste", c(rbind(bx, by), sep = "\r")) # bx <- bz[seq_len(nx)] # by <- bz[nx + seq_len(ny)] # bx[apply(is.na(x),1,any)] <- NA # by[apply(is.na(y),1,any)] <- NA # comm <- match(bx, by, 0L, incomparables) # # The resulting patched merge produces: source("merge.fixed.R") merge.fixed(x, y, by = c("k1","k2")) # NA's match k1 k2 data.x data.y 1 4 4 4 4 2 5 5 5 5 3 NA 1 1 1 4 NA 1 1 2 5 NA 1 1 3 6 NA NA 2 1 7 NA NA 2 2 8 NA NA 2 3 9 3 NA 3 1 10 3 NA 3 2 11 3 NA 3 3 merge.fixed(x, y, by = c("k1","k2"), incomparables = NA) k1 k2 data.x data.y 1 4 4 4 4 2 5 5 5 5 merge.fixed(x, y, by = "k1") # NA's match, so 6 rows k1 k2.x data.x k2.y data.y 1 4 4 4 4 4 2 5 5 5 5 5 3 NA 1 1 NA 1 4 NA 1 1 3 3 5 NA NA 2 NA 1 6 NA NA 2 3 3 merge.fixed(x, y, by = "k2", incomparables = NA) # 2 rows k2 k1.x data.x k1.y data.y 1 4 4 4 4 4 2 5 5 5 5 5 # This appears to match the documented behavior. # R version version _ platform x86_64-apple-darwin10.8.0 arch x86_64 os darwin10.8.0 system x86_64, darwin10.8.0 status major 3 minor 0.0 year 2013 month 04 day 03 svn rev 62481 language R version.string R version 3.0.0 (2013-04-03) nickname Masked Marvel sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base