Greetings. I was reading through the vignette for "tidy-data" (from
the
"tidyr" package) and came across something that puzzled me.
One of the examples in the vignette uses a data set related to tuberculosis,
originally from the World Health Organization, but also available at:
https://github.com/hadley/tidy-data/blob/master/data/tb.csv
Here's the code:
+++++
> library(dplyr) #### for tbl_df
> library(tidyr) #### for gather
> tb <- tbl_df(read.csv("tb.csv", stringsAsFactors=FALSE))
> tb2 <- tb %>%
+ gather(demo, n, -iso2, -year, na.rm=TRUE)
> str(tb2)
Classes ?tbl_df?, ?tbl? and 'data.frame': 35750 obs. of 4 variables:
$ iso2: chr "AD" "AD" "AD" "AE" ...
$ year: int 2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
$ demo: Factor w/ 20 levels
"m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
$ n : int 0 0 0 0 0 0 0 0 1 0 ...>
-----
I thought it might be interesting to see how to do this using the
"reshape2"
package. Here's the code for that:
+++++
library(reshape2)
tb2a <- tb %>%
melt(
id.vars=c("iso2", "year"),
variable.name="demo",
value.name="n",
na.rm=TRUE)
tb2a <- tbl_df(tb2a)
> str(tb2a)
Classes ?tbl_df?, ?tbl? and 'data.frame': 35750 obs. of 4 variables:
$ iso2: chr "AD" "AD" "AD" "AE" ...
$ year: int 2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
$ demo: Factor w/ 20 levels
"m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
$ n : int 0 0 0 0 0 0 0 0 1 0 ...>
-----
The "str" results make it appear that I'm on the right track, but
it's always
good to double check:
+++++
> all.equal(tb2, tb2a)
[1] "Rows in x but not y: 34659, 34658, 34656, 34655, 34651, 34650, 34649,
34648, 34647, 34646, 32264[...]Rows in y but not x: 35663, 34658, 34657,
34656, 34655, 34652, 34651, 34650, 34649, 32265,
32264[...]">
-----
Hmm. Not what I'd hoped for, but all the simple, visual tests I did did not
show any differences. After a little trial and error, I found the place where
the results differ:
+++++
> ROWS <- 2356
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] TRUE> ROWS <- 2357
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] "Rows in x but not y: 2357Rows in y but not x: 2357"
-----
OK, let's have a look at the spot where things go off the rails:
+++++
> tb2[2357, ]
Source: local data frame [1 x 4]
iso2 year demo n
1 NA 1995 m014 0> tb2a[2357, ]
Source: local data frame [1 x 4]
iso2 year demo n
1 NA 1995 m014 0>
-----
Things certainly *look* the same, but:
+++++
> all.equal(tb2[2357, ], tb2a[2357, ])
[1] "Rows in x but not y: 1Rows in y but not x:
1">
-----
If you guessed that it's the NA that's the source of the problem,
you're
evidently correct:
+++++
> head(which(is.na(tb2[ , "iso2"])))
[1] 2357 2358 2359 2360 2361 2362>
-----
But I don't understand what the problem is. The "all.equal"
function does
appear to deal appropriately with NA's. Here's a trivial example:
+++++
> library(pryr)
Attaching package: ?pryr?
The following object is masked from ?package:dplyr?:
%.%
> foo <- c(3, NA, 7)
> bar <- c(3, NA, 7)
> address(foo) #### note that foo and bar are distinct objects
[1] "0x422c278"> address(bar)
[1] "0x4953188"> all.equal(foo, bar) #### but they're still equal, even with NA
[1] TRUE>
-----
And just to be sure, I checked that these really are NA's in foo and bar:
+++++
> any(is.na(foo))
[1] TRUE> any(is.na(bar))
[1] TRUE>
-----
It finally occurred to me to strip off the extra class attributes and do the
comparison:
+++++
> all.equal(data.frame(tb2), data.frame(tb2a))
[1] TRUE>
-----
So this is evidently a "solution" to the problem, but I don't know
what the
moral of the story is. If you have any insights, please pass 'em along.
Thanks.
-- Mike