hugh.genin at thomsonreuters.com
2010-Jun-09 17:36 UTC
[R] bug? in stats::cor for use=complete.obs with NAs
Arrrrr, I think I've found a bug in the behavior of the stats::cor function when NAs are present, but in case I'm missing something, could you look over this example and let me know what you think:> a = c(1,3,NA,1,2) > b = c(1,2,1,1,4) > cor(a,b,method="spearman", use="complete.obs")[1] 0.8164966> cor(a,b,method="spearman", use="pairwise.complete.obs")[1] 0.7777778 My understanding is that, when the inputs are vectors (but not necessarily when they're matrices), the "complete.obs" and "pairwise.complete.obs" arguments should give identical spearman correlations. The above example clearly shows they do not in my version of R (2.11.1). However, in cor.test, they do:> cor.test(a,b,method="spearman", use="complete.obs")Spearman's rank correlation rho data: a and b S = 2.2222, p-value = 0.2222 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.7777778 So cor and cor.test do not agree, which seems very likely to be a bug. When calculating by hand, I also get 0.7777778. Additionally, when using an old version of R (2.5.0), both the complete.obs and pairwise.complete.obs versions give 0.7777778. Which strongly suggests either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it already been reported? (I found a related but confusing email thread from 2004 in the R archives, but I did not find the resolution to that bug report). Additional info: Platform = Windows XP> sessionInfo()R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.11.1> Sys.getlocale()[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" Thanks, --Hugh
I don't think that this would be considered a bug. The
reason for the discrepancy between use="complete.obs"
and use="pairwise.complete.obs" for the case of the
Spearman correlation of two vectors x, y is this:
"pairwise" does complete.cases(x,y) and then ranks;
this is also what's done in cor.test().
"complete" ranks first (keeping NAs via the
na.last="keep" argument to rank()) and then does
complete.cases(ranked.x,ranked.y) on the ranked data.
This can obviously lead to a different set of
ranks being correlated than those for "pairwise".
I must admit that I wasn't aware that R does this
and I don't know the rationale for it. The help page
says:
If use is "complete.obs" then missing values are
handled by casewise deletion ...
which is not clear on the order of ranking and
deletion, but further down the page:
Note that "spearman" basically computes cor(R(x), R(y))
(or cov(.,.)) where R(u) := rank(u, na.last="keep").
In the case of missing values, the ranks are calculated
depending on the value of use, either based on complete
observations, or based on pairwise completeness with
reranking for each pair.
I guess that this implies that, for "complete", the ranking
occurs before the casewise deletion (else why the
na.last="keep"?).
If anyone knows the rationale and/or can give a reference,
I'd be glad to receive such.
-Peter Ehlers
On 2010-06-09 11:36, hugh.genin at thomsonreuters.com
wrote:> Arrrrr,
>
> I think I've found a bug in the behavior of the stats::cor function
when
> NAs are present, but in case I'm missing something, could you look over
> this example and let me know what you think:
>
>
>> a = c(1,3,NA,1,2)
>> b = c(1,2,1,1,4)
>> cor(a,b,method="spearman", use="complete.obs")
> [1] 0.8164966
>> cor(a,b,method="spearman",
use="pairwise.complete.obs")
> [1] 0.7777778
>
> My understanding is that, when the inputs are vectors (but not
> necessarily when they're matrices), the "complete.obs" and
> "pairwise.complete.obs" arguments should give identical spearman
> correlations. The above example clearly shows they do not in my version
> of R (2.11.1). However, in cor.test, they do:
>
>
>> cor.test(a,b,method="spearman", use="complete.obs")
>
> Spearman's rank correlation rho
>
> data: a and b
> S = 2.2222, p-value = 0.2222
> alternative hypothesis: true rho is not equal to 0
> sample estimates:
> rho
> 0.7777778
>
>
> So cor and cor.test do not agree, which seems very likely to be a bug.
> When calculating by hand, I also get 0.7777778. Additionally, when
> using an old version of R (2.5.0), both the complete.obs and
> pairwise.complete.obs versions give 0.7777778. Which strongly suggests
> either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it
> already been reported? (I found a related but confusing email thread
> from 2004 in the R archives, but I did not find the resolution to that
> bug report).
>
>
> Additional info:
> Platform = Windows XP
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Thanks,
>
> --Hugh
>
Maybe Matching Threads
- Incorrect handling of NA's in cor() (PR#6750)
- Wrong result with cor(x, y, method="spearman", use="complete.obs") with NA's???
- strange behavior of cor() with pairwise.complete.obs
- how to see inbuilt function(cor.test) & how to get p-value from t-value(test of significance) ?
- how to determine/assign a numeric vector to "Y" in the cor.test function for spearman's correlations?