Hi, everyone: I ask for help about translating a stata program into R. The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. ================================================================First is stata code. forvalues i=1(1)200 { local event="cancer" local predictors1 = "total_psa" local predictors2 = "total_psa free_psa" local prediction1 = "base" local prediction2 = "full" g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''") drop u set `prediction1' `prediction2' } use "`dca1'", clear forvalues i=2(1)200 { append using "`dca`i''" } collapse all none modelp1 modelp2, by(threshold) save "cross validation dca output.dta", replace twoway(line none all modelp1 modelp2 threshold, sort) ================================================================Here is my draft of R code. cMain is my dataset. predca<-rep(0,40000) dim(predca)<-c(200,200) for (i in 1:200) { cvgroup<-rep(1:10,length=110) cvgroup<-sample(cvgroup) cvpre<-rep(0,length=110) cvMain<-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev<-cvMain[cvMain$cvgroup!=j,] cvval<-cvMain[cvMain$cvgroup==j,] cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) cvprej<-predict(cvfit,cvval,type="fitted") #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre<-prej } cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr<-cvnb/cvtpthres predca[cvn,1:99]<-cvnb predca[cvn,101:199]<-cvnr } ================================================================ My questions are 1. How to ensure equal numbers of events in each set in R? 2. A part of stata code is forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } I don't understand what's the difference between prediction1 and prediction2 3. Is my code right? Thanks ! Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China [[alternative HTML version deleted]]
Frank E Harrell Jr
2010-Jan-21 14:33 UTC
[R] cross validation function translated from stata
Take a look at the validate.lrm function in the rms package. Note that the use of threshold probabilities results in an improper scoring rule which will mislead you. Also note that you need to repeat 10-fold CV 50-100 times for precision, and that at each repeat you have to start from zero in analyzing associations. Frank zhu yao wrote:> Hi, everyone: > > I ask for help about translating a stata program into R. > > The program perform cross validation as it stated. > > #1. Randomly divide the data set into 10 sets of equal size, ensuring equal > numbers of events in each set > #2. Fit the model leaving out the 1st set > #3. Apply the fitted model in (2) to the 1st set to obtain the predicted > probability of a prostate cancer diagnosis. > #4. Repeat steps (2) to (3) leaving out and then applying the fitted model > to the ith group, i = 2, 3... 10. Every subject now has a predicted > probability of a prostate cancer diagnosis. > #5. Using the predicted probabilities, compute the net benefit at various > threshold probabilities. > #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each > threshold probability is the mean across the 200 replications. > > ================================================================> First is stata code. > > forvalues i=1(1)200 { > local event="cancer" > local predictors1 = "total_psa" > local predictors2 = "total_psa free_psa" > local prediction1 = "base" > local prediction2 = "full" > > g `prediction1'=. > g `prediction2'=. > > quietly g u = uniform() > sort `event' u > g set = mod(_n, 10) + 1 > > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > tempfile dca`i' > quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''") > > drop u set `prediction1' `prediction2' > } > > > use "`dca1'", clear > forvalues i=2(1)200 { > append using "`dca`i''" > } > > collapse all none modelp1 modelp2, by(threshold) > > save "cross validation dca output.dta", replace > > twoway(line none all modelp1 modelp2 threshold, sort) > > ================================================================> Here is my draft of R code. cMain is my dataset. > > predca<-rep(0,40000) > dim(predca)<-c(200,200) > > for (i in 1:200) { > cvgroup<-rep(1:10,length=110) > cvgroup<-sample(cvgroup) > cvpre<-rep(0,length=110) > cvMain<-cbind(cMain,cvgroup,cvpre) > > for (j in 1:10) { > cvdev<-cvMain[cvMain$cvgroup!=j,] > cvval<-cvMain[cvMain$cvgroup==j,] > cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) > cvprej<-predict(cvfit,cvval,type="fitted") > > #put the fitted value in dataset > cvMain[cvgroup==j,]$cvpre<-prej > } > > cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) > cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) > cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) > cvnr<-cvnb/cvtpthres > predca[cvn,1:99]<-cvnb > predca[cvn,101:199]<-cvnr > } > > ================================================================> > My questions are > 1. How to ensure equal numbers of events in each set in R? > 2. A part of stata code is > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > I don't understand what's the difference between prediction1 and > prediction2 > 3. Is my code right? > > Thanks ! > > Yao Zhu > Department of Urology > Fudan University Shanghai Cancer Center > Shanghai, China > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University
Hi, On Thu, Jan 21, 2010 at 8:55 AM, zhu yao <mailzhuyao at gmail.com> wrote:> Hi, everyone: > > I ask for help about translating a stata program into R. > > The program perform cross validation as it stated. > > #1. Randomly divide the data set into 10 sets of equal size, ensuring equal > numbers of events in each set > #2. Fit the model leaving out the 1st set > #3. Apply the fitted model in (2) to the 1st set to obtain the predicted > probability of a prostate cancer diagnosis. > #4. Repeat steps (2) to (3) leaving out and then applying the fitted model > to the ith group, i = 2, 3... 10. Every subject now has a predicted > probability of a prostate cancer diagnosis. > #5. Using the predicted probabilities, compute the net benefit at various > threshold probabilities. > #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each > threshold probability is the mean across the 200 replications. > > ================================================================> First is stata code. > > forvalues i=1(1)200 { > local event="cancer" > local predictors1 = "total_psa" > local predictors2 = "total_psa free_psa" > local prediction1 = "base" > local prediction2 = "full" > > g `prediction1'=. > g `prediction2'=. > > quietly g u = uniform() > sort `event' u > g set = mod(_n, 10) + 1 > > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > tempfile dca`i' > quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''") > > drop u set `prediction1' `prediction2' > } > > > use "`dca1'", clear > forvalues i=2(1)200 { > append using "`dca`i''" > } > > collapse all none modelp1 modelp2, by(threshold) > > save "cross validation dca output.dta", replace > > twoway(line none all modelp1 modelp2 threshold, sort) > > ================================================================> Here is my draft of R code. cMain is my dataset. > > predca<-rep(0,40000) > dim(predca)<-c(200,200) > > for (i in 1:200) { > ?cvgroup<-rep(1:10,length=110) > cvgroup<-sample(cvgroup) > cvpre<-rep(0,length=110) > cvMain<-cbind(cMain,cvgroup,cvpre) > > for (j in 1:10) { > cvdev<-cvMain[cvMain$cvgroup!=j,] > cvval<-cvMain[cvMain$cvgroup==j,] > ?cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) > cvprej<-predict(cvfit,cvval,type="fitted") > > #put the fitted value in dataset > cvMain[cvgroup==j,]$cvpre<-prej > } > > cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) > cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) > cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) > cvnr<-cvnb/cvtpthres > predca[cvn,1:99]<-cvnb > predca[cvn,101:199]<-cvnr > } > > ================================================================> > My questions are > 1. How to ensure equal numbers of events in each set in R?I just wanted to point you to the "createFolds" and "createDataPartition" functions in the caret package ... they try to do something similar, so perhaps you can see how others have tried to solve this problem: http://cran.r-project.org/web/packages/caret/index.html For example, from their help page: """For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits. For numeric y, the sample is split into groups sections based on quantiles and sampling is done within these subgroups. Also, for very small class sizes (<= 3) the classes may not show up in both the training and test data""" -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Thanks Frank and Steve. I rewrite the R code as follows. # m is the number of fold to split sample, n is the loop number of cross validation library(caret) calcvnb<-function(formula,dat,m,n) { cvnb<-rep(0,20000) dim(cvnb)<-c(200,100) for (i in 1:n) { group<-rep(0,length=110) sg<-createFolds(dat$LN,k=m) for (k in 1:m) { group[sg[[k]]]<-k } pre<-rep(0,length=110) data1<-cbind(dat,group,pre) for (j in 1:m) { dev<-data1[data1$group!=j,] val<-data1[data1$group==j,] fit<-lrm(formula,data=dev,x=T,y=T) pre1<-predict(fit,val,type="fitted") data1[group==j,]$pre<-pre1 } dcaop<-dca(data1$LN,data1$pre,prob=("Y")) nb<-100*(dcaop[,1]-dcaop[,2]) cvnb[i,1:99]<-nb } mcvnb<-colMeans(cvnb) return(mcvnb) } # apply the function in main code optnb1<-calcvnb(formula=LN~factor(MTSTAGE)+factor(GRADE)+LVINVAS+P53,dat=cMain,m=10,n=200) However, applied to my data, a error occurred after several loops "Error in contrasts <- '('*tmp*',value="contr.treatment"): contrasts can be applied only to factors with 2 or more levels". Whats wrong with my code and how to handle it ? Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China 2010/1/21 zhu yao <mailzhuyao@gmail.com>> Hi, everyone: > > I ask for help about translating a stata program into R. > > The program perform cross validation as it stated. > > #1. Randomly divide the data set into 10 sets of equal size, ensuring equal > numbers of events in each set > #2. Fit the model leaving out the 1st set > #3. Apply the fitted model in (2) to the 1st set to obtain the predicted > probability of a prostate cancer diagnosis. > #4. Repeat steps (2) to (3) leaving out and then applying the fitted model > to the ith group, i = 2, 3... 10. Every subject now has a predicted > probability of a prostate cancer diagnosis. > #5. Using the predicted probabilities, compute the net benefit at various > threshold probabilities. > #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each > threshold probability is the mean across the 200 replications. > > ================================================================> First is stata code. > > forvalues i=1(1)200 { > local event="cancer" > local predictors1 = "total_psa" > local predictors2 = "total_psa free_psa" > local prediction1 = "base" > local prediction2 = "full" > > g `prediction1'=. > g `prediction2'=. > > quietly g u = uniform() > sort `event' u > g set = mod(_n, 10) + 1 > > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > tempfile dca`i' > quietly dca `event' `prediction1' `prediction2', graphoff > saving("`dca`i''") > > drop u set `prediction1' `prediction2' > } > > > use "`dca1'", clear > forvalues i=2(1)200 { > append using "`dca`i''" > } > > collapse all none modelp1 modelp2, by(threshold) > > save "cross validation dca output.dta", replace > > twoway(line none all modelp1 modelp2 threshold, sort) > > ================================================================> Here is my draft of R code. cMain is my dataset. > > predca<-rep(0,40000) > dim(predca)<-c(200,200) > > for (i in 1:200) { > cvgroup<-rep(1:10,length=110) > cvgroup<-sample(cvgroup) > cvpre<-rep(0,length=110) > cvMain<-cbind(cMain,cvgroup,cvpre) > > for (j in 1:10) { > cvdev<-cvMain[cvMain$cvgroup!=j,] > cvval<-cvMain[cvMain$cvgroup==j,] > cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) > cvprej<-predict(cvfit,cvval,type="fitted") > > #put the fitted value in dataset > cvMain[cvgroup==j,]$cvpre<-prej > } > > cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) > cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) > cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) > cvnr<-cvnb/cvtpthres > predca[cvn,1:99]<-cvnb > predca[cvn,101:199]<-cvnr > } > > ================================================================> > My questions are > 1. How to ensure equal numbers of events in each set in R? > 2. A part of stata code is > forvalues j=1(1)10{ > quietly logit `event' `predictors1' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction1' = ptemp if set==`j' > drop ptemp > > quietly logit `event' `predictors2' if set~=`j' > quietly predict ptemp if set==`j' > quietly replace `prediction2' = ptemp if set==`j' > drop ptemp > } > I don't understand what's the difference between prediction1 and > prediction2 > 3. Is my code right? > > Thanks ! > > Yao Zhu > Department of Urology > Fudan University Shanghai Cancer Center > Shanghai, China >[[alternative HTML version deleted]]