Aimin Yan
2006-Dec-25 17:35 UTC
[R] Problem to generate training data set and test data set
I have a full data set like this: aa bas aas bms ams bcu acu omega y 1 ALA 0 127.71 0 69.99 0 -0.2498560 79.91470 outward 2 PRO 0 68.55 0 55.44 0 -0.0949008 76.60380 outward 3 ALA 0 52.72 0 47.82 0 -0.0396550 52.19970 outward 4 PHE 0 22.62 0 31.21 0 0.1270330 169.52500 inward 5 SER 0 71.32 0 52.84 0 -0.1312380 7.47528 outward 6 VAL 0 12.92 0 22.40 0 0.1728390 149.09400 inward ...................................................................................... aa have 19 levels, and there are different number of observation for each levels. I want to pick 75% of observations of each levels randomly to generate a training set, and 25% of observation of each levels to generate a testing set. Does anyone know to do this? Thanks Aimin Yan
Jim Lemon
2006-Dec-26 00:16 UTC
[R] Problem to generate training data set and test data set
Aimin Yan wrote:> I have a full data set like this: > > aa bas aas bms ams bcu acu omega y > 1 ALA 0 127.71 0 69.99 0 -0.2498560 79.91470 outward > 2 PRO 0 68.55 0 55.44 0 -0.0949008 76.60380 outward > 3 ALA 0 52.72 0 47.82 0 -0.0396550 52.19970 outward > 4 PHE 0 22.62 0 31.21 0 0.1270330 169.52500 inward > 5 SER 0 71.32 0 52.84 0 -0.1312380 7.47528 outward > 6 VAL 0 12.92 0 22.40 0 0.1728390 149.09400 inward > ...................................................................................... > > > aa have 19 levels, and there are different number of observation for each > levels. > I want to pick 75% of observations of each levels randomly to generate a > training set, > and 25% of observation of each levels to generate a testing set. >Hi Aimin, I haven't tested this exhaustively, but I think it does what you want. get.prob.sample<-function(x,prob=0.5) { xlevels<-levels(as.factor(x)) xlength<-length(x) xsamp<-rep(FALSE,xlength) for(i in xlevels) { lengthi<-length(x[x == i]) xsamp[sample(which(x == i),lengthi*prob)]<-TRUE } return(xsamp) } get.prob.sample(mydata$aa,0.75) Jim
Charles C. Berry
2006-Dec-26 17:43 UTC
[R] Problem to generate training data set and test data set
What you describe is called stratified sampling. It was discusssed last month (and other times) on this list: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/90220.html Using RSiteSearch("stratified sampling") will produce many hits to relevant articles and packages. On Mon, 25 Dec 2006, Aimin Yan wrote:> I have a full data set like this: > > aa bas aas bms ams bcu acu omega y > 1 ALA 0 127.71 0 69.99 0 -0.2498560 79.91470 outward > 2 PRO 0 68.55 0 55.44 0 -0.0949008 76.60380 outward > 3 ALA 0 52.72 0 47.82 0 -0.0396550 52.19970 outward > 4 PHE 0 22.62 0 31.21 0 0.1270330 169.52500 inward > 5 SER 0 71.32 0 52.84 0 -0.1312380 7.47528 outward > 6 VAL 0 12.92 0 22.40 0 0.1728390 149.09400 inward > ...................................................................................... > > > aa have 19 levels, and there are different number of observation for each > levels. > I want to pick 75% of observations of each levels randomly to generate a > training set, > and 25% of observation of each levels to generate a testing set. > > Does anyone know to do this? > > Thanks > > Aimin Yan > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717