Eddie Smith
2012-Nov-19 17:16 UTC
[R] How to subset my data and at the same time keep the balance?
Hi guys, I have 1000 rows of a dataset. In my analysis, I need 70% of the data, run my analysis and then use the remaining 30% to test my model. Could anybody kindly help me on this? Cheers
Rui Barradas
2012-Nov-19 17:25 UTC
[R] How to subset my data and at the same time keep the balance?
Hello, See the following example. x <- matrix(rnorm(2000), ncol = 2) idx <- sample(nrow(x), 0.7*nrow(x)) x2 <- x[idx, ] nrow(x2) # 700 x3 <- x[-idx, ] nrow(x3) # 300 Hope this helps, Rui Barradas Em 19-11-2012 17:16, Eddie Smith escreveu:> Hi guys, > > I have 1000 rows of a dataset. In my analysis, I need 70% of the data, > run my analysis and then use the remaining 30% to test my model. > > Could anybody kindly help me on this? > > Cheers > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Sarah Goslee
2012-Nov-19 17:26 UTC
[R] How to subset my data and at the same time keep the balance?
I'm not sure what you mean by "balance", but you can use sample() to randomly order the values 1:1000, then use the first 700 as row indices for the first set, and the last 300 as the test set. Sarah On Mon, Nov 19, 2012 at 12:16 PM, Eddie Smith <eddieatr at gmail.com> wrote:> Hi guys, > > I have 1000 rows of a dataset. In my analysis, I need 70% of the data, > run my analysis and then use the remaining 30% to test my model. > > Could anybody kindly help me on this? > > Cheers-- Sarah Goslee functionaldiversity.org
arun
2012-Nov-19 17:31 UTC
[R] How to subset my data and at the same time keep the balance?
HI, May be this helps: dat1<-read.table(text=" ? V1 V2 1 5 10 2 6? 3 3 8? 4 4 9 20 5 15 30 6 25 40 7 2? 4 8 3? 1 9 1? 5 10 8 10 ",header=TRUE) dat2<-dat1[sample(NROW(dat1),NROW(dat1)*(1-0.3)),] #70% of data dat2$newcol<-TRUE ?dat1$newcol1<-TRUE ?dat4<-merge(dat1,dat2,by=c("V1","V2"),all=TRUE) ?dat5<-dat4[is.na(dat4$newcol),][,1:2]? #remaining 30% ?dat5 #? V1 V2 #2? 2? 4 #4? 5 10 #8? 9 20 A.K. ----- Original Message ----- From: Eddie Smith <eddieatr at gmail.com> To: r-help at r-project.org Cc: Sent: Monday, November 19, 2012 12:16 PM Subject: [R] How to subset my data and at the same time keep the balance? Hi guys, I have 1000 rows of a dataset. In my analysis, I need 70% of the data, run my analysis and then use the remaining 30% to test my model. Could anybody kindly help me on this? Cheers ______________________________________________ R-help at r-project.org mailing list stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Eddie Smith
2012-Nov-19 19:07 UTC
[R] How to subset my data and at the same time keep the balance?
Thanks a lot! I got some ideas from all the replies and here is the final one. newdata select <- sample(nrow(newdata), nrow(newdata) * .7) data70 <- newdata[select,] # select write.csv(data70, "data70.csv", row.names=FALSE) data30 <- newdata[-select,] # testing write.csv(data30, "data30.csv", row.names=FALSE) Cheers
Brian Feeny
2012-Nov-20 02:23 UTC
[R] How to subset my data and at the same time keep the balance?
Just curious, once you have a model that works well, does it make sense to then tune it against 100% of the dataset (with known outcomes) so you can apply it to data you wish to predict for or is that a bad approach? I have done like is explained in this thread many times, taken a sample, learned against it, and then tested on the remaining. But this is using data for which we know the predicted variable and can compare to validate. So after your done, should you re-tune with the entire training set? As for which method, I am using mostly SVM Brian On Nov 19, 2012, at 2:07 PM, Eddie Smith <eddieatr at gmail.com> wrote:> Thanks a lot! I got some ideas from all the replies and here is the final one. > > newdata > > select <- sample(nrow(newdata), nrow(newdata) * .7) > data70 <- newdata[select,] # select > write.csv(data70, "data70.csv", row.names=FALSE) > > data30 <- newdata[-select,] # testing > write.csv(data30, "data30.csv", row.names=FALSE) > > Cheers > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jeff Newmiller
2012-Nov-20 05:24 UTC
[R] How to subset my data and at the same time keep the balance?
No. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Brian Feeny <bfeeny at me.com> wrote:> >Just curious, once you have a model that works well, does it make sense >to then tune it against 100% of the dataset (with known outcomes) >so you can apply it to data you wish to predict for or is that a bad >approach? > >I have done like is explained in this thread many times, taken a >sample, learned against it, and then tested on the remaining. But this >is using data >for which we know the predicted variable and can compare to validate. >So after your done, should you re-tune with the entire training set? > >As for which method, I am using mostly SVM > >Brian > >On Nov 19, 2012, at 2:07 PM, Eddie Smith <eddieatr at gmail.com> wrote: > >> Thanks a lot! I got some ideas from all the replies and here is the >final one. >> >> newdata >> >> select <- sample(nrow(newdata), nrow(newdata) * .7) >> data70 <- newdata[select,] # select >> write.csv(data70, "data70.csv", row.names=FALSE) >> >> data30 <- newdata[-select,] # testing >> write.csv(data30, "data30.csv", row.names=FALSE) >> >> Cheers >> >> ______________________________________________ >> R-help at r-project.org mailing list >> stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >______________________________________________ >R-help at r-project.org mailing list >stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.