hugh.genin at thomsonreuters.com
2010-Jun-09 17:36 UTC
[R] bug? in stats::cor for use=complete.obs with NAs
Arrrrr, I think I've found a bug in the behavior of the stats::cor function when NAs are present, but in case I'm missing something, could you look over this example and let me know what you think:> a = c(1,3,NA,1,2) > b = c(1,2,1,1,4) > cor(a,b,method="spearman", use="complete.obs")[1] 0.8164966> cor(a,b,method="spearman", use="pairwise.complete.obs")[1] 0.7777778 My understanding is that, when the inputs are vectors (but not necessarily when they're matrices), the "complete.obs" and "pairwise.complete.obs" arguments should give identical spearman correlations. The above example clearly shows they do not in my version of R (2.11.1). However, in cor.test, they do:> cor.test(a,b,method="spearman", use="complete.obs")Spearman's rank correlation rho data: a and b S = 2.2222, p-value = 0.2222 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.7777778 So cor and cor.test do not agree, which seems very likely to be a bug. When calculating by hand, I also get 0.7777778. Additionally, when using an old version of R (2.5.0), both the complete.obs and pairwise.complete.obs versions give 0.7777778. Which strongly suggests either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it already been reported? (I found a related but confusing email thread from 2004 in the R archives, but I did not find the resolution to that bug report). Additional info: Platform = Windows XP> sessionInfo()R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.11.1> Sys.getlocale()[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" Thanks, --Hugh
I don't think that this would be considered a bug. The reason for the discrepancy between use="complete.obs" and use="pairwise.complete.obs" for the case of the Spearman correlation of two vectors x, y is this: "pairwise" does complete.cases(x,y) and then ranks; this is also what's done in cor.test(). "complete" ranks first (keeping NAs via the na.last="keep" argument to rank()) and then does complete.cases(ranked.x,ranked.y) on the ranked data. This can obviously lead to a different set of ranks being correlated than those for "pairwise". I must admit that I wasn't aware that R does this and I don't know the rationale for it. The help page says: If use is "complete.obs" then missing values are handled by casewise deletion ... which is not clear on the order of ranking and deletion, but further down the page: Note that "spearman" basically computes cor(R(x), R(y)) (or cov(.,.)) where R(u) := rank(u, na.last="keep"). In the case of missing values, the ranks are calculated depending on the value of use, either based on complete observations, or based on pairwise completeness with reranking for each pair. I guess that this implies that, for "complete", the ranking occurs before the casewise deletion (else why the na.last="keep"?). If anyone knows the rationale and/or can give a reference, I'd be glad to receive such. -Peter Ehlers On 2010-06-09 11:36, hugh.genin at thomsonreuters.com wrote:> Arrrrr, > > I think I've found a bug in the behavior of the stats::cor function when > NAs are present, but in case I'm missing something, could you look over > this example and let me know what you think: > > >> a = c(1,3,NA,1,2) >> b = c(1,2,1,1,4) >> cor(a,b,method="spearman", use="complete.obs") > [1] 0.8164966 >> cor(a,b,method="spearman", use="pairwise.complete.obs") > [1] 0.7777778 > > My understanding is that, when the inputs are vectors (but not > necessarily when they're matrices), the "complete.obs" and > "pairwise.complete.obs" arguments should give identical spearman > correlations. The above example clearly shows they do not in my version > of R (2.11.1). However, in cor.test, they do: > > >> cor.test(a,b,method="spearman", use="complete.obs") > > Spearman's rank correlation rho > > data: a and b > S = 2.2222, p-value = 0.2222 > alternative hypothesis: true rho is not equal to 0 > sample estimates: > rho > 0.7777778 > > > So cor and cor.test do not agree, which seems very likely to be a bug. > When calculating by hand, I also get 0.7777778. Additionally, when > using an old version of R (2.5.0), both the complete.obs and > pairwise.complete.obs versions give 0.7777778. Which strongly suggests > either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it > already been reported? (I found a related but confusing email thread > from 2004 in the R archives, but I did not find the resolution to that > bug report). > > > Additional info: > Platform = Windows XP >> sessionInfo() > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > loaded via a namespace (and not attached): > [1] tools_2.11.1 >> Sys.getlocale() > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" > > Thanks, > > --Hugh >
Apparently Analagous Threads
- Incorrect handling of NA's in cor() (PR#6750)
- Wrong result with cor(x, y, method="spearman", use="complete.obs") with NA's???
- strange behavior of cor() with pairwise.complete.obs
- how to see inbuilt function(cor.test) & how to get p-value from t-value(test of significance) ?
- how to determine/assign a numeric vector to "Y" in the cor.test function for spearman's correlations?