Dear Tom,
I'm afraid that you're confusing the point-biserial correlation with the
biserial correlation. It's the latter that polyserial() in the polycor
package computes; the biserial correlation is a special case when the
(ordered) categorical variable has just two categories.
Generally, biserial correlations will be larger than point-serial
correlations, since the former estimates the correlation in an underlying
bivariate normal distribution that has been binned. I haven't looked closely
at your example, but suspect that the difference in sign is just due to the
arbitrary specification of which category is higher than the other. As for a
reference, ?polyserial gives one.
I hope this helps,
John
--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Willems
> Sent: Monday, October 29, 2007 11:01 AM
> To: r-help at r-project.org
> Subject: [R] biserial correlation with pkg polycor
>
> Dear R-ussers,
>
> While looking for a way to calculate the association between
> a countinuous and a binary variable, i found a procedure
> called point biserial corralation.
> Me, not being a mathematicion, i did my very best to
> understand what it was all about, and then i found a easily
> understandable paper (by steve
> simon) on ow to calculate this. ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm
> (this page has the same example) Further i discovered the
> polycor package in R.
> Now i'm having troubles with the fact that the polycor pkg
> never gives me the same output as the manuals aplication of
> the formula.
>
> In the example below found, manualy r(biserial) = 0.49
> between fb an age, and ussing function polyserial (polycor
> pkg) r(biserial) =-0.8591.
> This is a rather big difference, no due to abriviation or
> flootingpoints.
>
> Is there someone whom is familiar with biserial correlation,
> and the appropriate way to calculate it?
>
> Kind regards,
> Tom.
>
> here is the example, at the end is the R file.
>
> 1e I create the input
>
> > library(abind)
> > library (polycor)
> > ### data input
> > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19,
> > 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22,
> 17)
> > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14,
> > 12,
> 18)
> > age <-c('elderly', 'elderly', 'elderly',
'elderly', 'elderly',
> 'elderly', 'elderly', 'elderly', 'elderly',
'young', 'young',
> 'young', 'young', 'young', 'young',
'young', 'young')
> >
> > dataset <- data.frame(no,fb,ss,age)
> > dataset <- subset(dataset,select=c(fb:age))
> > nrow(dataset)
> [1] 17
> > data_eld <- subset(dataset,age=='elderly',select=fb)
> > data_young <- subset(dataset,age=='young',select=fb)
> >
>
> here i calculate the R_bis (biserial corelation) manualy
>
> > ### point biserial correlation
> >
> > fb <- subset(dataset,select=fb)
> > fb0 <- subset(dataset,age!='elderly',select=fb)
> > fb1 <- subset(dataset,age=='elderly',select=fb)
> > meanfb0 <- mean(fb0,na.rm=T)
> > meanfb1 <- mean(fb1,na.rm=T)
> > sdfb<- sd(dataset$fb,na.rm=T)
> >
> > ss <- subset(dataset, select=ss)
> > ss0 <- subset(dataset,age!='elderly',select=ss)
> > ss1 <- subset(dataset,age=='elderly',select=ss)
> > meanss0 <- mean(ss0,na.rm=T)
> > meanss1 <- mean(ss1,na.rm=T)
> > sdss<- sd(dataset$ss,na.rm=T)
> > age <- subset(dataset,select=age)
> > n <- nrow(dataset)
> >
> this is the formula from ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm
>
> > > R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n)
> >+ (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> *sqrt(p*(1-p))) }
>
> this is the corrected formula from ref ##
> http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
> >> R_bis2 <- function(x,x1,x0,n){
> + ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> * (sqrt(
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
> >
> >> R_bis(fb,fb1,fb0,n)
> > fb
> >0.4798873
>
> result in paper was 0.49
>
> >> R_bis2(fb,fb1,fb0,n)
> > fb
> >0.4946565
>
> equals result in paper 0.49
>
> Then i use the polycor package,
> function hetcor will give all the different correlation ressults
>
>
> >> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
> > dataset$fb dataset.ss dataset.age
> >dataset$fb 1 Pearson Polyserial
> >dataset.ss 0.703 1 Polyserial
> >dataset.age -0.8591 -0.6685 1
> >
> >Standard Errors:
> > dataset$fb dataset.ss
> >dataset$fb
> >dataset.ss 0.1215
> >dataset.age 0.1106 0.2497
> >
> >n = 17
> >
> >P-values for Tests of Bivariate Normality:
> > dataset$fb dataset.ss
> >dataset$fb
> >dataset.ss 0.1782
> >dataset.age 0.4269 0.4034
> > hetcor(dataset,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
> > fb ss age
> >fb 1 Pearson Polyserial
> >ss 0.703 1 Polyserial
> >age -0.8591 -0.6685 1
> >
> >Standard Errors:
> > fb ss
> >fb
> >ss 0.1215
> >age 0.1106 0.2497
> >
> >n = 17
> >
> >P-values for Tests of Bivariate Normality:
> > fb ss
> >fb
> >ss 0.1782
> >age 0.4269 0.4034
>
> here a quick two step method is ussed to calculate the polyserial
> correlation
>
> > >polyserial(dataset$fb,dataset$age)
> >[1] -0.6205737
> >> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE)
>
> same method as in hetcor, only for indecated variables
>
> >Polyserial Correlation, ML est. = -0.8591 (0.1106)
> >Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269
> >
> > 1
> >Threshold 0.1811
> >Std.Err. 0.1849
> >
>
>
> > ### for side to side (ss) incase no 9 is an outlier in
> fb, this will
> not be the case in ss
> >
> >> R_bis(ss,ss1,ss0,n)
> > ss
> >0.4153681
>
> result in paper was 0.43
>
> >> R_bis2(ss,ss1,ss0,n)
> > ss
> >0.4281516
>
> equals result in paper 0.43
>
> >> polyserial(dataset$ss,dataset$age)
> >[1] -0.5371397
>
> >> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)
> >
> >Polyserial Correlation, ML est. = -0.6685 (0.2497)
> >Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034
> >
> > 1
> >Threshold 0.1504
> >Std.Err. 0.2583
> ##############################################################
> ###################
>
> ### R-file
> ##############################################################
> ###################
>
> library(abind)
> library (polycor)
> ### data input
> no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15,
> 14, 14, 22,
> 17)
> ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22,
> 12, 14, 12,
> 18)
> age <-c('elderly', 'elderly', 'elderly',
'elderly',
> 'elderly', 'elderly',
> 'elderly', 'elderly', 'elderly', 'young',
'young', 'young', 'young',
> 'young', 'young', 'young', 'young')
>
> dataset <- data.frame(no,fb,ss,age)
> dataset <- subset(dataset,select=c(fb:age))
> nrow(dataset)
> data_eld <- subset(dataset,age=='elderly',select=fb)
> data_young <- subset(dataset,age=='young',select=fb)
>
> ### point biserial correlation
>
> fb <- subset(dataset,select=fb)
> fb0 <- subset(dataset,age!='elderly',select=fb)
> fb1 <- subset(dataset,age=='elderly',select=fb)
> meanfb0 <- mean(fb0,na.rm=T)
> meanfb1 <- mean(fb1,na.rm=T)
> sdfb<- sd(dataset$fb,na.rm=T)
>
> ss <- subset(dataset, select=ss)
> ss0 <- subset(dataset,age!='elderly',select=ss)
> ss1 <- subset(dataset,age=='elderly',select=ss)
> meanss0 <- mean(ss0,na.rm=T)
> meanss1 <- mean(ss1,na.rm=T)
> sdss<- sd(dataset$ss,na.rm=T)
> age <- subset(dataset,select=age)
> n <- nrow(dataset)
>
> R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n)
> (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> *sqrt(p*(1-p))) }
> R_bis2 <- function(x,x1,x0,n){
> ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) * (sqrt(
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
>
> R_bis(fb,fb1,fb0,n)
> R_bis2(fb,fb1,fb0,n)
>
>
> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
> hetcor(dataset,ML=TRUE)
>
> polyserial(dataset$fb,dataset$age)
> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE)
>
> ### for side to side (ss) incase no 9 is an outlier in fb,
> this will not
> be the case in ss
>
> R_bis(ss,ss1,ss0,n)
> R_bis2(ss,ss1,ss0,n)
>
> polyserial(dataset$ss,dataset$age)
> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)
>
> ## END
> ##############################################################
> ############
>
> Willems Tom
>
> E-mail: towil at var.fgov.be
>
>
>
>
> Disclaimer: click here
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>