thr3ads.net - R help - [R] cross validation function translated from stata [Jan 2010]

If this information is useful, please help other people find it:
Share via:

zhu yao

2010-Jan-21 13:55 UTC

[R] cross validation function translated from stata

Hi, everyone:

I ask for help about translating a stata program into R.

The program perform cross validation as it stated.

#1. Randomly divide the data set into 10 sets of equal size, ensuring equal
numbers of events in each set
#2. Fit the model leaving out the 1st set
#3. Apply the fitted model in (2) to the 1st set to obtain the predicted
probability of a prostate cancer diagnosis.
#4. Repeat steps (2) to (3) leaving out and then applying the fitted model
to the ith group, i = 2, 3... 10. Every subject now has a predicted
probability of a prostate cancer diagnosis.
#5. Using the predicted probabilities, compute the net benefit at various
threshold probabilities.
#6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
threshold probability is the mean across the 200 replications.

================================================================First is stata
code.

forvalues i=1(1)200 {
local event="cancer"
local predictors1 = "total_psa"
local predictors2 = "total_psa free_psa"
local prediction1 = "base"
local prediction2 = "full"

g `prediction1'=.
g `prediction2'=.

quietly g u = uniform()
sort `event' u
g set = mod(_n, 10) + 1

forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp


quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
tempfile dca`i'
quietly dca `event' `prediction1' `prediction2', graphoff
saving("`dca`i''")

drop u set `prediction1' `prediction2'
}


use "`dca1'", clear
forvalues i=2(1)200 {
append using "`dca`i''"
}

collapse all none modelp1 modelp2, by(threshold)

save "cross validation dca output.dta", replace

twoway(line none all modelp1 modelp2 threshold, sort)

================================================================Here is my draft
of R code. cMain is my dataset.

predca<-rep(0,40000)
dim(predca)<-c(200,200)

for (i in 1:200) {
 cvgroup<-rep(1:10,length=110)
cvgroup<-sample(cvgroup)
cvpre<-rep(0,length=110)
cvMain<-cbind(cMain,cvgroup,cvpre)

for (j in 1:10) {
cvdev<-cvMain[cvMain$cvgroup!=j,]
cvval<-cvMain[cvMain$cvgroup==j,]
 cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
cvprej<-predict(cvfit,cvval,type="fitted")

#put the fitted value in dataset
cvMain[cvgroup==j,]$cvpre<-prej
}

cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
cvnr<-cvnb/cvtpthres
predca[cvn,1:99]<-cvnb
predca[cvn,101:199]<-cvnr
}

================================================================
My questions are
1. How to ensure equal numbers of events in each set in R?
2. A part of stata code is
                        forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp

quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
  I don't understand what's the difference between prediction1 and
prediction2
3. Is my code right?

Thanks !

Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China

	[[alternative HTML version deleted]]

Frank E Harrell Jr

2010-Jan-21 14:33 UTC

head link

[R] cross validation function translated from stata

Take a look at the validate.lrm function in the rms package.

Note that the use of threshold probabilities results in an improper 
scoring rule which will mislead you.  Also note that you need to repeat 
10-fold CV 50-100 times for precision, and that at each repeat you have 
to start from zero in analyzing associations.

Frank

zhu yao wrote:> Hi, everyone:
> 
> I ask for help about translating a stata program into R.
> 
> The program perform cross validation as it stated.
> 
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
> 
> ================================================================> First
is stata code.
> 
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
> 
> g `prediction1'=.
> g `prediction2'=.
> 
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
> 
> forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
> 
> 
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff
saving("`dca`i''")
> 
> drop u set `prediction1' `prediction2'
> }
> 
> 
> use "`dca1'", clear
> forvalues i=2(1)200 {
> append using "`dca`i''"
> }
> 
> collapse all none modelp1 modelp2, by(threshold)
> 
> save "cross validation dca output.dta", replace
> 
> twoway(line none all modelp1 modelp2 threshold, sort)
> 
> ================================================================> Here
is my draft of R code. cMain is my dataset.
> 
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
> 
> for (i in 1:200) {
>  cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
> cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
> 
> for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
>  cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
> 
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
> }
> 
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
> cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
> cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
> 
> ================================================================> 
> My questions are
> 1. How to ensure equal numbers of events in each set in R?
> 2. A part of stata code is
>                         forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
> 
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
>   I don't understand what's the difference between prediction1 and
> prediction2
> 3. Is my code right?
> 
> Thanks !
> 
> Yao Zhu
> Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

Steve Lianoglou

2010-Jan-21 15:05 UTC

head link

[R] cross validation function translated from stata

Hi,

On Thu, Jan 21, 2010 at 8:55 AM, zhu yao <mailzhuyao at gmail.com>
wrote:> Hi, everyone:
>
> I ask for help about translating a stata program into R.
>
> The program perform cross validation as it stated.
>
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
>
> ================================================================> First
is stata code.
>
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
>
> g `prediction1'=.
> g `prediction2'=.
>
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
>
> forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
>
>
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff
saving("`dca`i''")
>
> drop u set `prediction1' `prediction2'
> }
>
>
> use "`dca1'", clear
> forvalues i=2(1)200 {
> append using "`dca`i''"
> }
>
> collapse all none modelp1 modelp2, by(threshold)
>
> save "cross validation dca output.dta", replace
>
> twoway(line none all modelp1 modelp2 threshold, sort)
>
> ================================================================> Here
is my draft of R code. cMain is my dataset.
>
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
>
> for (i in 1:200) {
> ?cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
> cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
>
> for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
> ?cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
>
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
> }
>
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
> cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
> cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
>
> ================================================================>
> My questions are
> 1. How to ensure equal numbers of events in each set in R?
I just wanted to point you to the "createFolds" and
"createDataPartition" functions in the caret package ... they try to
do something similar, so perhaps you can see how others have tried to
solve this problem:

http://cran.r-project.org/web/packages/caret/index.html

For example, from their help page:

"""For other data splitting, the random sampling is done within
the
levels of y when y is a factor in an attempt to balance the class
distributions within the splits. For numeric y, the sample is split
into groups sections based on quantiles and sampling is done within
these subgroups. Also, for very small class sizes (<= 3) the classes
may not show up in both the training and test data"""

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

zhu yao

2010-Jan-22 02:03 UTC

head link

[R] cross validation function translated from stata

Thanks Frank and Steve.
I rewrite the R code as follows.

# m is the number of fold to split sample, n is the loop number of cross
validation

library(caret)

calcvnb<-function(formula,dat,m,n)
{

cvnb<-rep(0,20000)
dim(cvnb)<-c(200,100)
 for (i in 1:n)
{
 group<-rep(0,length=110)
sg<-createFolds(dat$LN,k=m)
for (k in 1:m)
{
group[sg[[k]]]<-k
}
 pre<-rep(0,length=110)
data1<-cbind(dat,group,pre)
 for (j in 1:m)
{
dev<-data1[data1$group!=j,]
val<-data1[data1$group==j,]
 fit<-lrm(formula,data=dev,x=T,y=T)
pre1<-predict(fit,val,type="fitted")

data1[group==j,]$pre<-pre1
}

dcaop<-dca(data1$LN,data1$pre,prob=("Y"))
nb<-100*(dcaop[,1]-dcaop[,2])
cvnb[i,1:99]<-nb
}
mcvnb<-colMeans(cvnb)
return(mcvnb)
}

# apply the function in main code

optnb1<-calcvnb(formula=LN~factor(MTSTAGE)+factor(GRADE)+LVINVAS+P53,dat=cMain,m=10,n=200)


However, applied to my data,  a error occurred after several loops "Error
in
contrasts <- '('*tmp*',value="contr.treatment"):
contrasts can be applied only to factors with 2 or more levels".

Whats wrong with my code and how to handle it ?

Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China
Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China


2010/1/21 zhu yao <mailzhuyao@gmail.com>
> Hi, everyone:
>
> I ask for help about translating a stata program into R.
>
> The program perform cross validation as it stated.
>
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
>
> ================================================================> First
is stata code.
>
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
>
> g `prediction1'=.
> g `prediction2'=.
>
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
>
> forvalues j=1(1)10{
>  quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
>  quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
>
>
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
>  quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff
> saving("`dca`i''")
>
> drop u set `prediction1' `prediction2'
> }
>
>
> use "`dca1'", clear
> forvalues i=2(1)200 {
>  append using "`dca`i''"
> }
>
> collapse all none modelp1 modelp2, by(threshold)
>
> save "cross validation dca output.dta", replace
>
> twoway(line none all modelp1 modelp2 threshold, sort)
>
> ================================================================> Here
is my draft of R code. cMain is my dataset.
>
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
>
>  for (i in 1:200) {
>  cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
>  cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
>
>  for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
>  cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
>
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
>  }
>
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
>  cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
>  cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
>
> ================================================================>
> My questions are
> 1. How to ensure equal numbers of events in each set in R?
> 2. A part of stata code is
>                         forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
>  quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
>  drop ptemp
>
> quietly logit `event' `predictors2' if set~=`j'
>  quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
>  drop ptemp
> }
>   I don't understand what's the difference between prediction1 and
> prediction2
> 3. Is my code right?
>
> Thanks !
>
> Yao Zhu
> Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China
>
	[[alternative HTML version deleted]]

R help - Jan 2010 - cross validation function translated from stata

[R] cross validation function translated from stata

[R] cross validation function translated from stata

[R] cross validation function translated from stata

[R] cross validation function translated from stata

Reasonably Related Threads