thr3ads.net - R devel - [Rd] Incorrect handling of NA's in cor() (PR#6750) [Apr 2004]

If this information is useful, please help other people find it:
Share via:

msa@biostat.mgh.harvard.edu

2004-Apr-09 18:46 UTC

[Rd] Incorrect handling of NA's in cor() (PR#6750)

Full_Name: Marek Ancukiewicz
Version: 1.8.1
OS: Linux
Submission from: (NULL) (132.183.12.87)


Function cor() incorrectly handles missing observation with
method="spearman":
> x <- c(1,2,3,NA,5,6)
> y <- c(4,NA,2,5,1,3)
> cor(x,y,use="complete.obs",method="s")
[1] -0.1428571>
cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")[1] -0.4

These two results should be the same.

ligges@statistik.uni-dortmund.de

2004-Apr-09 19:07 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

msa@biostat.mgh.harvard.edu wrote:> Full_Name: Marek Ancukiewicz
> Version: 1.8.1
> OS: Linux
> Submission from: (NULL) (132.183.12.87)
> 
> 
> Function cor() incorrectly handles missing observation with
method="spearman":
> 
> 
>>x <- c(1,2,3,NA,5,6)
>>y <- c(4,NA,2,5,1,3)
>>cor(x,y,use="complete.obs",method="s")
> 
> [1] -0.1428571
> 
>>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")
> 
> [1] -0.4
> 
> These two results should be the same.
> 

No! Please read at least the help file, ?cor, before submitting a bug 
report:

"If use is "complete.obs" then missing values are handled by
casewise
deletion. Finally, if use has the value "pairwise.complete.obs" then
the
correlation between each pair of variables is computed using all 
complete pairs of observations on those variables."

Hence
   cor(x, y, use="pairwise.complete.obs", method="s")
is what you expect ...

Uwe Ligges

msa@biostat.mgh.harvard.edu

2004-Apr-09 19:22 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

Dear Uwe,

You are wrong. First, I've read the help file before
submitting the report. For two variables,
use="pairwise.complete.obs" and use="complete.obs" should be
equivalent, shouldn't it? Of sourse, the results will be
different when we have more than 2 variables. Second, with the
call you proposed I am also getting incorrect result:
> cor(x, y, use="pairwise.complete.obs", method="s")[1] -0.1428571

The correct result is -0.4, as correctly calculated by
cor.test()

Regards

Marek Ancukiewicz


> X-Original-To: msa@biostat.mgh.harvard.edu
> Date: Fri, 09 Apr 2004 19:06:47 +0200
> From: Uwe Ligges <ligges@statistik.uni-dortmund.de>
> Organization: Fachbereich Statistik, Universitaet Dortmund
> X-Accept-Language: en-us, en, de-de, de
> Cc: R-bugs@biostat.ku.dk
> 
> msa@biostat.mgh.harvard.edu wrote:
> > Full_Name: Marek Ancukiewicz
> > Version: 1.8.1
> > OS: Linux
> > Submission from: (NULL) (132.183.12.87)
> > 
> > 
> > Function cor() incorrectly handles missing observation with
method="spearman":
> > 
> > 
> >>x <- c(1,2,3,NA,5,6)
> >>y <- c(4,NA,2,5,1,3)
> >>cor(x,y,use="complete.obs",method="s")
> > 
> > [1] -0.1428571
> > 
>
>>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")
> > 
> > [1] -0.4
> > 
> > These two results should be the same.
> > 
> 
> 
> No! Please read at least the help file, ?cor, before submitting a bug 
> report:
> 
> 
> "If use is "complete.obs" then missing values are handled by
casewise
> deletion. Finally, if use has the value "pairwise.complete.obs"
then the
> correlation between each pair of variables is computed using all 
> complete pairs of observations on those variables."
> 
> 
> Hence
>    cor(x, y, use="pairwise.complete.obs", method="s")
> is what you expect ...
> 
> Uwe Ligges
>

ligges@statistik.uni-dortmund.de

2004-Apr-09 19:35 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

Marek Ancukiewicz wrote:> Dear Uwe,
> 
> You are wrong. 
Whoops. My apologies!!!


In R-1.9.0 beta I get:

cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")
# [1] -0.4
cor(x,y,use="complete.obs", method="s")
# [1] -0.5291503

I'll take a look!

Uwe


> First, I've read the help file before
> submitting the report. For two variables,
> use="pairwise.complete.obs" and use="complete.obs"
should be
> equivalent, shouldn't it?  Of sourse, the results will be
> different when we have more than 2 variables. Second, with the
> call you proposed I am also getting incorrect result:
> 
> 
>>cor(x, y, use="pairwise.complete.obs", method="s")
> 
> [1] -0.1428571
> 
> The correct result is -0.4, as correctly calculated by
> cor.test()
> 
> Regards
> 
> Marek Ancukiewicz
> 
> 
> 
> 
>>X-Original-To: msa@biostat.mgh.harvard.edu
>>Date: Fri, 09 Apr 2004 19:06:47 +0200
>>From: Uwe Ligges <ligges@statistik.uni-dortmund.de>
>>Organization: Fachbereich Statistik, Universitaet Dortmund
>>X-Accept-Language: en-us, en, de-de, de
>>Cc: R-bugs@biostat.ku.dk
>>
>>msa@biostat.mgh.harvard.edu wrote:
>>
>>>Full_Name: Marek Ancukiewicz
>>>Version: 1.8.1
>>>OS: Linux
>>>Submission from: (NULL) (132.183.12.87)
>>>
>>>
>>>Function cor() incorrectly handles missing observation with
method="spearman":
>>>
>>>
>>>
>>>>x <- c(1,2,3,NA,5,6)
>>>>y <- c(4,NA,2,5,1,3)
>>>>cor(x,y,use="complete.obs",method="s")
>>>
>>>[1] -0.1428571
>>>
>>>
>>>>cor(x[!is.na(x)&!is.na(y)],y[!is.na(x)&!is.na(y)],method="s")
>>>
>>>[1] -0.4
>>>
>>>These two results should be the same.
>>>
>>
>>
>>No! Please read at least the help file, ?cor, before submitting a bug 
>>report:
>>
>>
>>"If use is "complete.obs" then missing values are handled
by casewise
>>deletion. Finally, if use has the value
"pairwise.complete.obs" then the
>>correlation between each pair of variables is computed using all 
>>complete pairs of observations on those variables."
>>
>>
>>Hence
>>   cor(x, y, use="pairwise.complete.obs",
method="s")
>>is what you expect ...
>>
>>Uwe Ligges
>>

tlumley@u.washington.edu

2004-Apr-09 20:22 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

On Fri, 9 Apr 2004, Marek Ancukiewicz wrote:
>
> Dear Thomas,
>
> The question becomes: how do we rank missing values?
That's one of the questions.  It's not the only question.  Suppose x has
no missing values but y has a missing value.  Should the ranks for x be
based on the whole vector or just on the values where y isn't missing?

	-thomas

p.dalgaard@biostat.ku.dk

2004-Apr-09 20:40 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

Marek Ancukiewicz <msa@biostat.mgh.harvard.edu> writes:
> Dear Thomas,
> 
> The question becomes: how do we rank missing values?  In
> version 1.8.1 at least, cor () uses default handling of
> missing values by rank() [by na.last parameter], that is
> missing values are assigned the highest rank. However, if
> nothing is known about the meaning of NA what would be the
> basis of such an assumption?  Assigning the NAs highest,
> lowest values, or any other values requires some additional
> information.
> 
> It seems that the default handling on missing values should be
> to assign them missing ranks: within cor(), rank() should be
> called with na.last="keep". 
Yes, and that is what 1.9.0beta is doing (it's not like this issue
hasn't been brought up before, just that the fix didn't quite fix it).
I think what we have now is still buggy, but at least it isn't biasing
rho towards +1 whenever x and y tend to be both missing at the same
time.

It's fairly easy to do something more sensible in the complete.cases
case, but getting pairwise.complete.cases right is tricky. 1.9.0
is in deep code freeze, so I don't think we should change things at
this point, except perhaps add a note to the help page.

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

msa@biostat.mgh.harvard.edu

2004-Apr-09 21:50 UTC

head link

[Rd] Incorrect handling of NA's in cor() (PR#6750)

> X-Original-To: msa@biostat.mgh.harvard.edu
> Date: Fri, 9 Apr 2004 11:21:47 -0700 (PDT)
> From: Thomas Lumley <tlumley@u.washington.edu>
> Cc: R-bugs@biostat.ku.dk
> 
> On Fri, 9 Apr 2004, Marek Ancukiewicz wrote:
> 
> >
> > Dear Thomas,
> >
> > The question becomes: how do we rank missing values?
> 
> That's one of the questions.  It's not the only question.  Suppose
x has
> no missing values but y has a missing value.  Should the ranks for x be
> based on the whole vector or just on the values where y isn't missing?
> 
> 	-thomas
I see what you mean. 

One could give an argument in favour of each of these
approaches. If we treat data primarily as pairs of values (or
more generally, cases) then we should discard incomplete pairs
(records) first and rank afterwards. If we consider x and y
primarily as separate from each other (especially with regard
to how the missing values arise) then a more natural approach
would be to do ranking before dropping incomplete pairs. In
the later approach we use more information in the data; in the
former approach we ignore the information which might be
spurious, especially when missing y values tend to coincide
with high (low) x values. Dropping NAs first and ranking later
seems to be a conservative approach; with the other approach
on should probably always check if NAs in one variable are
correlated with other variables.

My understanding is that cor() in 1.9.0 will do ranking
independently, before dropping missing pairs/cases. It would
be good to have this documented in help(), it might be also
good to add a warning on perils of the analysis with missing
values when occurrences of NAs in one variable are correlated
with other variables.

Marek

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Apr 2004 - Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

[Rd] Incorrect handling of NA's in cor() (PR#6750)

Apparently Analagous Threads