thr3ads.net - R help - [R] Dealing with NA in "tbl

If this information is useful, please help other people find it:
Share via:
Michael Hannon
2015-Mar-21 23:31 UTC
[R] Dealing with NA in "tbl_df"?

Greetings.  I was reading through the vignette for "tidy-data" (from
the
"tidyr" package) and came across something that puzzled me.

One of the examples in the vignette uses a data set related to tuberculosis,
originally from the World Health Organization, but also available at:

  https://github.com/hadley/tidy-data/blob/master/data/tb.csv

Here's the code:

+++++
> library(dplyr)  #### for tbl_df
> library(tidyr)  #### for gather
> tb <- tbl_df(read.csv("tb.csv", stringsAsFactors=FALSE))
> tb2 <- tb %>%+     gather(demo, n, -iso2, -year, na.rm=TRUE)
> str(tb2)Classes ?tbl_df?, ?tbl? and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels
"m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...>
-----

I thought it might be interesting to see how to do this using the
"reshape2"
package.  Here's the code for that:

+++++

library(reshape2)

tb2a <- tb %>%
    melt(
        id.vars=c("iso2", "year"),
        variable.name="demo",
        value.name="n",
        na.rm=TRUE)
tb2a <- tbl_df(tb2a)
> str(tb2a)Classes ?tbl_df?, ?tbl? and 'data.frame': 35750 obs. of  4 variables:
 $ iso2: chr  "AD" "AD" "AD" "AE" ...
 $ year: int  2005 2006 2008 2006 2007 2008 2007 2005 2006 2007 ...
 $ demo: Factor w/ 20 levels
"m04","m514","m014",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ n   : int  0 0 0 0 0 0 0 0 1 0 ...>
-----


The "str" results make it appear that I'm on the right track, but
it's always
good to double check:

+++++
> all.equal(tb2, tb2a)[1] "Rows in x but not y: 34659, 34658, 34656, 34655, 34651, 34650, 34649,
34648, 34647, 34646, 32264[...]Rows in y but not x: 35663, 34658, 34657,
34656, 34655, 34652, 34651, 34650, 34649, 32265,
32264[...]">
-----

Hmm.  Not what I'd hoped for, but all the simple, visual tests I did did not
show any differences.  After a little trial and error, I found the place where
the results differ:

+++++
> ROWS <- 2356
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])
[1] TRUE> ROWS <- 2357
> all.equal(tb2[1:ROWS, ], tb2a[1:ROWS, ])[1] "Rows in x but not y: 2357Rows in y but not x: 2357"

-----

OK, let's have a look at the spot where things go off the rails:

+++++
> tb2[2357, ]Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0> tb2a[2357, ]Source: local data frame [1 x 4]

  iso2 year demo n
1   NA 1995 m014 0>
-----

Things certainly *look* the same, but:

+++++
> all.equal(tb2[2357, ], tb2a[2357, ])[1] "Rows in x but not y: 1Rows in y but not x:
1">
-----

If you guessed that it's the NA that's the source of the problem,
you're
evidently correct:

+++++
> head(which(is.na(tb2[ , "iso2"])))
[1] 2357 2358 2359 2360 2361 2362>
-----

But I don't understand what the problem is.  The "all.equal"
function does
appear to deal appropriately with NA's.  Here's a trivial example:

+++++
> library(pryr)
Attaching package: ?pryr?

The following object is masked from ?package:dplyr?:

    %.%
> foo <- c(3, NA, 7)
> bar <- c(3, NA, 7)
> address(foo)  #### note that foo and bar are distinct objects
[1] "0x422c278"> address(bar)
[1] "0x4953188"> all.equal(foo, bar)  #### but they're still equal, even with NA
[1] TRUE>
-----

And just to be sure, I checked that these really are NA's in foo and bar:

+++++
> any(is.na(foo))
[1] TRUE> any(is.na(bar))
[1] TRUE>
-----

It finally occurred to me to strip off the extra class attributes and do the
comparison:

+++++
> all.equal(data.frame(tb2), data.frame(tb2a))
[1] TRUE>
-----

So this is evidently a "solution" to the problem, but I don't know
what the
moral of the story is.  If you have any insights, please pass 'em along.

Thanks.

-- Mike
R help - Mar 2015 - Dealing with NA in "tbl_df"?

[R] Dealing with NA in "tbl_df"?