thr3ads.net - R help - [R] Correlate [Aug 2022]

If this information is useful, please help other people find it:
Share via:

John Fox

2022-Aug-22 18:00 UTC

[R] Correlate

Dear Val,

On 2022-08-22 1:33 p.m., Val wrote:> For the time being  I am assuming the relationship across  variables
> is linear.  I want get the values first  and detailed examining  of
> the relationship will follow later.
This seems backwards to me, but I'll refrain from commenting further on 
whether what you want to do makes sense and instead address how to do it 
(not, BTW, because I disagree with Bert's and Tim's remarks).

Please see below:
> 
> On Mon, Aug 22, 2022 at 12:23 PM Ebert,Timothy Aaron <tebert at
ufl.edu> wrote:
>>
>> I (maybe) agree, but I would go further than that. There are
assumptions associated with the test that are missing. It is not clear that the
relationships are all linear. Regardless of a "significant outcome"
all of the relationships need to be explored in more detail than what is
provided in the correlation test.
>>
>> Multiplicity adjustment as in :
https://www.sciencedirect.com/science/article/pii/S0197245600001069 is not an
issue that I can see in these data from the information provided. At least not
in the same sense as used in the link.
>>
>> My first guess at the meaning of "multiplicity adjustment"
was closer to the experimentwise error rate in a multiple comparison procedure.
https://dictionary.apa.org/experiment-wise-error-rateEssentially, the type 1
error rate is inflated the more test you do and if you perform enough tests you
find significant outcomes by chance alone. There is great significance in the
Redskins rule: https://en.wikipedia.org/wiki/Redskins_Rule.
>>
>> A simple solution is to apply a Bonferroni correction where alpha is
divided by the number of comparisons. If there are 250, then 0.05/250 = 0.0002.
Another approach is to try to discuss the outcomes in a way that makes sense.
What is the connection between a football team's last home game an the
election result that would enable me to take another team and apply their last
home game result to the outcome of a different election?
>>
>> Another complication is if variables x2 through x250 are themselves
correlated. Not enough information was provided in the problem to know if this
is an issue, but 250 orthogonal variables in a real dataset would be a bit
unusual considering the experimentwise error rate previously mentioned.
>>
>> Large datasets can be very messy.
>>
>>
>> Tim
>>
>> -----Original Message-----
>> From: Bert Gunter <bgunter.4567 at gmail.com>
>> Sent: Monday, August 22, 2022 12:07 PM
>> To: Ebert,Timothy Aaron <tebert at ufl.edu>
>> Cc: Val <valkremk at gmail.com>; r-help at R-project.org (r-help
at r-project.org) <r-help at r-project.org>
>> Subject: Re: [R] Correlate
>>
>> [External Email]
>>
>> ... But of course the p-values are essentially meaningless without some
sort of multiplicity adjustment.
>> (search on "multiplicity adjustment" for details). :-(
>>
>> -- Bert
>>
>>
>> On Mon, Aug 22, 2022 at 8:59 AM Ebert,Timothy Aaron <tebert at
ufl.edu> wrote:
>>>
>>> A somewhat clunky solution:
>>> for(i in colnames(dat)){
>>>    print(cor.test(dat[,i], dat$x1, method = "pearson",
use = "complete.obs")$estimate)
>>>    print(cor.test(dat[,i], dat$x1, method = "pearson",
use >>> "complete.obs")$p.value) }
Because of missing data, this computes the correlations on different 
subsets of the data. A simple solution is to filter the data for NAs:

D <- na.omit(dat)

More comments below:
>>>
>>> Rather than printing you could set up an array or list to save the
results.
>>>
>>>
>>> Tim
>>>
>>> -----Original Message-----
>>> From: R-help <r-help-bounces at r-project.org> On Behalf Of
Val
>>> Sent: Monday, August 22, 2022 11:09 AM
>>> To: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
>>> Subject: [R] Correlate
>>>
>>> [External Email]
>>>
>>> Hi all,
>>>
>>> I have a data set with  ~250  variables(columns).  I want to
calculate
>>> the correlation of  one variable with the rest of the other
variables
>>> and also want  the p-values  for each correlation.  Please see the
>>> sample data and my attempt.  I  have got the correlation but unable
to
>>> get the p-values
>>>
>>> dat <- read.table(text="x1 x2 x3 x4
>>>             1.68 -0.96 -1.25  0.61
>>>            -0.06  0.41  0.06 -0.96
>>>                .    0.08  1.14  1.42
>>>             0.80 -0.67  0.53 -0.68
>>>             0.23 -0.97 -1.18 -0.78
>>>            -1.03  1.11 -0.61    .
>>>             2.15     .    0.02  0.66
>>>             0.35 -0.37 -0.26  0.39
>>>            -0.66  0.89   .    -1.49
>>>             0.11  1.52  0.73  -1.03",header=TRUE)
>>>
>>> #change all to numeric
>>>      dat[] <- lapply(dat, function(x)
as.numeric(as.character(x)))
This data manipulation is unnecessary. Just specify the argument 
na.strings="." to read.table().
>>>
>>>      data_cor <- cor(dat[ , colnames(dat) != "x1"], 
dat$x1, method >>> "pearson", use = "complete.obs")
>>>
>>> Result
>>>                [,1]
>>> x2 -0.5845835
>>> x3 -0.4664220
>>> x4  0.7202837
>>>
>>> How do I get the p-values ?
Taking a somewhat different approach from cor.test(), you can apply 
Fisher's z-transformation (recall that D is the data filtered for NAs):

 > 2*pnorm(abs(atanh(data_cor)), sd=1/sqrt(nrow(D) - 3), lower.tail=FALSE)
         [,1]
x2 0.2462807
x3 0.3812854
x4 0.1156939

I hope this helps,
  John
>>>
>>> Thank you,
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>>
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
>>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl
>>>
.edu%7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e
>>>
1b84%7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
>>>
LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>>>
&amp;sdata=3iAfMs1QzQARKF3lqUI8s43PX4IIkgEuQ9PUDyUtpqY%3D&amp;reserved
>>> =0 PLEASE do read the posting guide
>>>
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
>>>
-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%
>>>
7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e1b84%
>>>
7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
>>>
DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;
>>>
sdata=v3IEonnPgg1xTKUzLK4rJc3cfMFxw5p%2FW6puha5CFz0%3D&amp;reserved=0
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>>
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
>>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl
>>>
.edu%7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e
>>>
1b84%7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
>>>
LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>>>
&amp;sdata=3iAfMs1QzQARKF3lqUI8s43PX4IIkgEuQ9PUDyUtpqY%3D&amp;reserved
>>> =0 PLEASE do read the posting guide
>>>
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
>>>
-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%
>>>
7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e1b84%
>>>
7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
>>>
DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;
>>>
sdata=v3IEonnPgg1xTKUzLK4rJc3cfMFxw5p%2FW6puha5CFz0%3D&amp;reserved=0
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.-- 
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web: https://socialsciences.mcmaster.ca/jfox/

Val

2022-Aug-26 14:41 UTC

head link

[R] Correlate

Hi John and Timothy

Thank you for your suggestion and help. Using the sample data, I did
carry out a test run and found a difference in the correlation result.

Option 1.
data_cor <- cor(dat[ , colnames(dat) != "x1"],  # Calculate
correlations
                    dat$x1, method = "pearson", use =
"complete.obs")
resulted
                 [,1]
    x2 -0.5845835
    x3 -0.4664220
    x4  0.7202837

Option 2.
 for(i in colnames(dat)){
      print(cor.test(dat[,i], dat$x1, method = "pearson", use
"complete.obs")$estimate)
    }
           [,1]
x2  -0.7362030
x3  -0.04935132
x4   0.85766290

This was crosschecked  using Excel and other softwares and all matches
with option 2.
One of the factors that contributed for this difference  is loss of
information when we are using na.rm(). This is because that if x2 has
missing value but x3 and x4 don?t have then  na.rm()  removed  entire
row information including x3 and x4.

My question is there  a way to extract the number of rows (N)  used in
the correlation analysis?.
Thank you,

On Mon, Aug 22, 2022 at 1:00 PM John Fox <jfox at mcmaster.ca>
wrote:>
> Dear Val,
>
> On 2022-08-22 1:33 p.m., Val wrote:
> > For the time being  I am assuming the relationship across  variables
> > is linear.  I want get the values first  and detailed examining  of
> > the relationship will follow later.
>
> This seems backwards to me, but I'll refrain from commenting further on
> whether what you want to do makes sense and instead address how to do it
> (not, BTW, because I disagree with Bert's and Tim's remarks).
>
> Please see below:
>
> >
> > On Mon, Aug 22, 2022 at 12:23 PM Ebert,Timothy Aaron <tebert at
ufl.edu> wrote:
> >>
> >> I (maybe) agree, but I would go further than that. There are
assumptions associated with the test that are missing. It is not clear that the
relationships are all linear. Regardless of a "significant outcome"
all of the relationships need to be explored in more detail than what is
provided in the correlation test.
> >>
> >> Multiplicity adjustment as in :
https://www.sciencedirect.com/science/article/pii/S0197245600001069 is not an
issue that I can see in these data from the information provided. At least not
in the same sense as used in the link.
> >>
> >> My first guess at the meaning of "multiplicity
adjustment" was closer to the experimentwise error rate in a multiple
comparison procedure.
https://dictionary.apa.org/experiment-wise-error-rateEssentially, the type 1
error rate is inflated the more test you do and if you perform enough tests you
find significant outcomes by chance alone. There is great significance in the
Redskins rule: https://en.wikipedia.org/wiki/Redskins_Rule.
> >>
> >> A simple solution is to apply a Bonferroni correction where alpha
is divided by the number of comparisons. If there are 250, then 0.05/250 =
0.0002. Another approach is to try to discuss the outcomes in a way that makes
sense. What is the connection between a football team's last home game an
the election result that would enable me to take another team and apply their
last home game result to the outcome of a different election?
> >>
> >> Another complication is if variables x2 through x250 are
themselves correlated. Not enough information was provided in the problem to
know if this is an issue, but 250 orthogonal variables in a real dataset would
be a bit unusual considering the experimentwise error rate previously mentioned.
> >>
> >> Large datasets can be very messy.
> >>
> >>
> >> Tim
> >>
> >> -----Original Message-----
> >> From: Bert Gunter <bgunter.4567 at gmail.com>
> >> Sent: Monday, August 22, 2022 12:07 PM
> >> To: Ebert,Timothy Aaron <tebert at ufl.edu>
> >> Cc: Val <valkremk at gmail.com>; r-help at R-project.org
(r-help at r-project.org) <r-help at r-project.org>
> >> Subject: Re: [R] Correlate
> >>
> >> [External Email]
> >>
> >> ... But of course the p-values are essentially meaningless without
some sort of multiplicity adjustment.
> >> (search on "multiplicity adjustment" for details). :-(
> >>
> >> -- Bert
> >>
> >>
> >> On Mon, Aug 22, 2022 at 8:59 AM Ebert,Timothy Aaron <tebert at
ufl.edu> wrote:
> >>>
> >>> A somewhat clunky solution:
> >>> for(i in colnames(dat)){
> >>>    print(cor.test(dat[,i], dat$x1, method =
"pearson", use = "complete.obs")$estimate)
> >>>    print(cor.test(dat[,i], dat$x1, method =
"pearson", use > >>> "complete.obs")$p.value) }
>
> Because of missing data, this computes the correlations on different
> subsets of the data. A simple solution is to filter the data for NAs:
>
> D <- na.omit(dat)
>
> More comments below:
>
> >>>
> >>> Rather than printing you could set up an array or list to save
the results.
> >>>
> >>>
> >>> Tim
> >>>
> >>> -----Original Message-----
> >>> From: R-help <r-help-bounces at r-project.org> On Behalf
Of Val
> >>> Sent: Monday, August 22, 2022 11:09 AM
> >>> To: r-help at R-project.org (r-help at r-project.org)
<r-help at r-project.org>
> >>> Subject: [R] Correlate
> >>>
> >>> [External Email]
> >>>
> >>> Hi all,
> >>>
> >>> I have a data set with  ~250  variables(columns).  I want to
calculate
> >>> the correlation of  one variable with the rest of the other
variables
> >>> and also want  the p-values  for each correlation.  Please see
the
> >>> sample data and my attempt.  I  have got the correlation but
unable to
> >>> get the p-values
> >>>
> >>> dat <- read.table(text="x1 x2 x3 x4
> >>>             1.68 -0.96 -1.25  0.61
> >>>            -0.06  0.41  0.06 -0.96
> >>>                .    0.08  1.14  1.42
> >>>             0.80 -0.67  0.53 -0.68
> >>>             0.23 -0.97 -1.18 -0.78
> >>>            -1.03  1.11 -0.61    .
> >>>             2.15     .    0.02  0.66
> >>>             0.35 -0.37 -0.26  0.39
> >>>            -0.66  0.89   .    -1.49
> >>>             0.11  1.52  0.73  -1.03",header=TRUE)
> >>>
> >>> #change all to numeric
> >>>      dat[] <- lapply(dat, function(x)
as.numeric(as.character(x)))
>
> This data manipulation is unnecessary. Just specify the argument
> na.strings="." to read.table().
>
> >>>
> >>>      data_cor <- cor(dat[ , colnames(dat) !=
"x1"],  dat$x1, method > >>> "pearson", use =
"complete.obs")
> >>>
> >>> Result
> >>>                [,1]
> >>> x2 -0.5845835
> >>> x3 -0.4664220
> >>> x4  0.7202837
> >>>
> >>> How do I get the p-values ?
>
> Taking a somewhat different approach from cor.test(), you can apply
> Fisher's z-transformation (recall that D is the data filtered for NAs):
>
>  > 2*pnorm(abs(atanh(data_cor)), sd=1/sqrt(nrow(D) - 3),
lower.tail=FALSE)
>          [,1]
> x2 0.2462807
> x3 0.3812854
> x4 0.1156939
>
> I hope this helps,
>   John
>
> >>>
> >>> Thank you,
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>>
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> >>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl
> >>>
.edu%7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e
> >>>
1b84%7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> >>>
LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>>
&amp;sdata=3iAfMs1QzQARKF3lqUI8s43PX4IIkgEuQ9PUDyUtpqY%3D&amp;reserved
> >>> =0 PLEASE do read the posting guide
> >>>
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> >>>
-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%
> >>>
7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e1b84%
> >>>
7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> >>>
DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;
> >>>
sdata=v3IEonnPgg1xTKUzLK4rJc3cfMFxw5p%2FW6puha5CFz0%3D&amp;reserved=0
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>>
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> >>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl
> >>>
.edu%7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e
> >>>
1b84%7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> >>>
LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>>
&amp;sdata=3iAfMs1QzQARKF3lqUI8s43PX4IIkgEuQ9PUDyUtpqY%3D&amp;reserved
> >>> =0 PLEASE do read the posting guide
> >>>
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> >>>
-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%
> >>>
7C871d5009dd3c455f398f08da84585e4a%7C0d4da0f84a314d76ace60a62331e1b84%
> >>>
7C0%7C0%7C637967812337328788%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> >>>
DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;
> >>>
sdata=v3IEonnPgg1xTKUzLK4rJc3cfMFxw5p%2FW6puha5CFz0%3D&amp;reserved=0
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> --
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> web: https://socialsciences.mcmaster.ca/jfox/
>

R help - Aug 2022 - Correlate

[R] Correlate

[R] Correlate