Note: This is a minor comment on a recent thread, and neither a query
nor an answer to a query.
----
In a recent discussion thread here, it was asked how to compute the
counts for the number of complete cases that are used for computing
the entries in a correlation matrix obtained from
cor(X, use = "pairwise.complete.obs")
when there are missing values (i.e. NA's) in X.
(Whether it is wise to do this is another issue; but here it just
motivates this post).
As part of his solution, John Fox provided the following idiom for
computing the number of complete cases == rows without NA's from pairs
of columns of a matrix Z when Z has NA's. For columns i and j, the
number of rows without NA's is nrow(na.omit (Z[, c(i, j)] )). This
clearly works, because na.omit() is a generic (S3) function designed
to omit rows with NA's in matrix-like objects, and nrow() then just
counts the rows remaining, which is exactly what is needed.
I would call this "an intended intended consequence", because John
used na.omit() exactly as it's intended to be used.
However, sometimes one can do "better" -- in this case in a speed of
execution sense -- by "misusing" functionality in a way that is not
intended. Instead of John's "nrow(na.omit...))", the idiom:
sum(!is.na(rowSums(Z[, c(i,j)])))
turns out to be considerably faster. Here's a little example that
illustrates the point:
>library(microbenchmark)
> Z <- matrix(0, ncol = 2, nrow = 10000) ## 2 columns only for
illustration
> is.na(Z) <-sample(seq_len(20000),2000) ## 10% NA's
> ## check that both methods give the same answer
> nrow( na.omit(Z))
[1] 8112> sum( !is.na( rowSums(Z)))
[1] 8112
## timings ##> print(microbenchmark( nrow(na.omit(Z)), times = 50), signif = 3)
Unit: microseconds
expr min lq mean median uq max neval
nrow(na.omit(Z)) 116 122 128 128 132 160 50> # vs
> print(microbenchmark( sum(!is.na(rowSums(Z, na.rm = TRUE))), times = 50),
signif = 3)
Unit: microseconds
expr min lq mean median uq max neval
sum(!is.na( rowSums(Z))) 28 28.9 32.1 32.4 33.5 41.3 50
So a median time of 128 microseconds for nrow(na.omit...) vs. 32 for
sum(!is,na(rowSums(...), i.e. four times as fast. Why? -- the na.omit
approach does its looping at the interpreted R level; the
sum(!is.na...)
does most of its work at the compiled C level. There is a cost to this
efficiency improvement, however: the fast code is more opaque and thus
harder to understand and maintain, because it uses R's functionality
in unintended ways, i.e. for intended unintended consequences. As,
usual, he programmer must decide whether the tradeoff is worthwhile;
but it's nice to know when a tradeoff exists.
=========================================================For those who may be
interested, here is a brief explanation of the
tricks used in the faster solution.
rowSums(Z) gives the sums by row in Z, and will give NA if a row
contains any NA's. Note that this yields just a single vector of NA's
and numeric values.
!is.na (rowSums...) then converts the NA's to FALSE and numeric values
to TRUE, i.e. logicals in this vector.
But (TRUE, FALSE) is treated as (1, 0) by numeric operations, so
sum(...) just sums up the 1's, which is the same as counting the TRUEs
== complete case rows.
Cheers,
Bert