msa@biostat.mgh.harvard.edu
2004-Apr-09 18:46 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
Full_Name: Marek Ancukiewicz Version: 1.8.1 OS: Linux Submission from: (NULL) (132.183.12.87) Function cor() incorrectly handles missing observation with method="spearman":> x <- c(1,2,3,NA,5,6) > y <- c(4,NA,2,5,1,3) > cor(x,y,use="complete.obs",method="s")[1] -0.1428571> cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")[1] -0.4 These two results should be the same.
ligges@statistik.uni-dortmund.de
2004-Apr-09 19:07 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
msa@biostat.mgh.harvard.edu wrote:> Full_Name: Marek Ancukiewicz > Version: 1.8.1 > OS: Linux > Submission from: (NULL) (132.183.12.87) > > > Function cor() incorrectly handles missing observation with method="spearman": > > >>x <- c(1,2,3,NA,5,6) >>y <- c(4,NA,2,5,1,3) >>cor(x,y,use="complete.obs",method="s") > > [1] -0.1428571 > >>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s") > > [1] -0.4 > > These two results should be the same. >No! Please read at least the help file, ?cor, before submitting a bug report: "If use is "complete.obs" then missing values are handled by casewise deletion. Finally, if use has the value "pairwise.complete.obs" then the correlation between each pair of variables is computed using all complete pairs of observations on those variables." Hence cor(x, y, use="pairwise.complete.obs", method="s") is what you expect ... Uwe Ligges
msa@biostat.mgh.harvard.edu
2004-Apr-09 19:22 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
Dear Uwe, You are wrong. First, I've read the help file before submitting the report. For two variables, use="pairwise.complete.obs" and use="complete.obs" should be equivalent, shouldn't it? Of sourse, the results will be different when we have more than 2 variables. Second, with the call you proposed I am also getting incorrect result:> cor(x, y, use="pairwise.complete.obs", method="s")[1] -0.1428571 The correct result is -0.4, as correctly calculated by cor.test() Regards Marek Ancukiewicz> X-Original-To: msa@biostat.mgh.harvard.edu > Date: Fri, 09 Apr 2004 19:06:47 +0200 > From: Uwe Ligges <ligges@statistik.uni-dortmund.de> > Organization: Fachbereich Statistik, Universitaet Dortmund > X-Accept-Language: en-us, en, de-de, de > Cc: R-bugs@biostat.ku.dk > > msa@biostat.mgh.harvard.edu wrote: > > Full_Name: Marek Ancukiewicz > > Version: 1.8.1 > > OS: Linux > > Submission from: (NULL) (132.183.12.87) > > > > > > Function cor() incorrectly handles missing observation with method="spearman": > > > > > >>x <- c(1,2,3,NA,5,6) > >>y <- c(4,NA,2,5,1,3) > >>cor(x,y,use="complete.obs",method="s") > > > > [1] -0.1428571 > > > >>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s") > > > > [1] -0.4 > > > > These two results should be the same. > > > > > No! Please read at least the help file, ?cor, before submitting a bug > report: > > > "If use is "complete.obs" then missing values are handled by casewise > deletion. Finally, if use has the value "pairwise.complete.obs" then the > correlation between each pair of variables is computed using all > complete pairs of observations on those variables." > > > Hence > cor(x, y, use="pairwise.complete.obs", method="s") > is what you expect ... > > Uwe Ligges >
ligges@statistik.uni-dortmund.de
2004-Apr-09 19:35 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
Marek Ancukiewicz wrote:> Dear Uwe, > > You are wrong.Whoops. My apologies!!! In R-1.9.0 beta I get: cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s") # [1] -0.4 cor(x,y,use="complete.obs", method="s") # [1] -0.5291503 I'll take a look! Uwe> First, I've read the help file before > submitting the report. For two variables, > use="pairwise.complete.obs" and use="complete.obs" should be > equivalent, shouldn't it? Of sourse, the results will be > different when we have more than 2 variables. Second, with the > call you proposed I am also getting incorrect result: > > >>cor(x, y, use="pairwise.complete.obs", method="s") > > [1] -0.1428571 > > The correct result is -0.4, as correctly calculated by > cor.test() > > Regards > > Marek Ancukiewicz > > > > >>X-Original-To: msa@biostat.mgh.harvard.edu >>Date: Fri, 09 Apr 2004 19:06:47 +0200 >>From: Uwe Ligges <ligges@statistik.uni-dortmund.de> >>Organization: Fachbereich Statistik, Universitaet Dortmund >>X-Accept-Language: en-us, en, de-de, de >>Cc: R-bugs@biostat.ku.dk >> >>msa@biostat.mgh.harvard.edu wrote: >> >>>Full_Name: Marek Ancukiewicz >>>Version: 1.8.1 >>>OS: Linux >>>Submission from: (NULL) (132.183.12.87) >>> >>> >>>Function cor() incorrectly handles missing observation with method="spearman": >>> >>> >>> >>>>x <- c(1,2,3,NA,5,6) >>>>y <- c(4,NA,2,5,1,3) >>>>cor(x,y,use="complete.obs",method="s") >>> >>>[1] -0.1428571 >>> >>> >>>>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s") >>> >>>[1] -0.4 >>> >>>These two results should be the same. >>> >> >> >>No! Please read at least the help file, ?cor, before submitting a bug >>report: >> >> >>"If use is "complete.obs" then missing values are handled by casewise >>deletion. Finally, if use has the value "pairwise.complete.obs" then the >>correlation between each pair of variables is computed using all >>complete pairs of observations on those variables." >> >> >>Hence >> cor(x, y, use="pairwise.complete.obs", method="s") >>is what you expect ... >> >>Uwe Ligges >>
tlumley@u.washington.edu
2004-Apr-09 20:22 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
On Fri, 9 Apr 2004, Marek Ancukiewicz wrote:> > Dear Thomas, > > The question becomes: how do we rank missing values?That's one of the questions. It's not the only question. Suppose x has no missing values but y has a missing value. Should the ranks for x be based on the whole vector or just on the values where y isn't missing? -thomas
p.dalgaard@biostat.ku.dk
2004-Apr-09 20:40 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
Marek Ancukiewicz <msa@biostat.mgh.harvard.edu> writes:> Dear Thomas, > > The question becomes: how do we rank missing values? In > version 1.8.1 at least, cor () uses default handling of > missing values by rank() [by na.last parameter], that is > missing values are assigned the highest rank. However, if > nothing is known about the meaning of NA what would be the > basis of such an assumption? Assigning the NAs highest, > lowest values, or any other values requires some additional > information. > > It seems that the default handling on missing values should be > to assign them missing ranks: within cor(), rank() should be > called with na.last="keep".Yes, and that is what 1.9.0beta is doing (it's not like this issue hasn't been brought up before, just that the fix didn't quite fix it). I think what we have now is still buggy, but at least it isn't biasing rho towards +1 whenever x and y tend to be both missing at the same time. It's fairly easy to do something more sensible in the complete.cases case, but getting pairwise.complete.cases right is tricky. 1.9.0 is in deep code freeze, so I don't think we should change things at this point, except perhaps add a note to the help page. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
msa@biostat.mgh.harvard.edu
2004-Apr-09 21:50 UTC
[Rd] Incorrect handling of NA's in cor() (PR#6750)
> X-Original-To: msa@biostat.mgh.harvard.edu > Date: Fri, 9 Apr 2004 11:21:47 -0700 (PDT) > From: Thomas Lumley <tlumley@u.washington.edu> > Cc: R-bugs@biostat.ku.dk > > On Fri, 9 Apr 2004, Marek Ancukiewicz wrote: > > > > > Dear Thomas, > > > > The question becomes: how do we rank missing values? > > That's one of the questions. It's not the only question. Suppose x has > no missing values but y has a missing value. Should the ranks for x be > based on the whole vector or just on the values where y isn't missing? > > -thomasI see what you mean. One could give an argument in favour of each of these approaches. If we treat data primarily as pairs of values (or more generally, cases) then we should discard incomplete pairs (records) first and rank afterwards. If we consider x and y primarily as separate from each other (especially with regard to how the missing values arise) then a more natural approach would be to do ranking before dropping incomplete pairs. In the later approach we use more information in the data; in the former approach we ignore the information which might be spurious, especially when missing y values tend to coincide with high (low) x values. Dropping NAs first and ranking later seems to be a conservative approach; with the other approach on should probably always check if NAs in one variable are correlated with other variables. My understanding is that cor() in 1.9.0 will do ranking independently, before dropping missing pairs/cases. It would be good to have this documented in help(), it might be also good to add a warning on perils of the analysis with missing values when occurrences of NAs in one variable are correlated with other variables. Marek
Apparently Analagous Threads
- Problems with predict.lm: incorrect SE estimate (PR#7772)
- Incorrect Kendall's tau for ordered variables (PR#14207)
- bug? in stats::cor for use=complete.obs with NAs
- Factor structures not preserved after dump/dput (PR#200)
- strange behavior of cor() with pairwise.complete.obs