Hi People, Thus anyone have a good solution for this problem: a database called DB. index <- sample(1:nrow(DB), size=0.2*nrow(BD)) test <- DB[index,] train <- DB[-index,] One of the variables in this database contais a target variable with two values 0 and 1. Imagine now that i want to constraint the test data frame so the 20% of the size of "test" has 50% of DB$target. Imagine: n=100 DB$target = { 0=80 1=20} test=20 and contain 10 random values of DB$target=1 and 10 random values of DB$target=0. Many Thanks, Eliano -- View this message in context: http://r.789695.n4.nabble.com/Sampling-with-Constraints-for-testing-and-training-data-tp4325530p4325530.html Sent from the R help mailing list archive at Nabble.com.
Hi People, Does anyone have a good solution for this problem: a database called DB. index <- sample(1:nrow(DB), size=0.2*nrow(BD)) test <- DB[index,] train <- DB[-index,] One of the variables in this database contais a target variable with two values 0 and 1. Imagine now that i want to constraint the test data frame so the 20% of the size of "test" has 50% of DB$target. Imagine: n=100 DB$target = { 0=80 1=20} test=20 and contain 10 random values of DB$target=1 and 10 random values of DB$target=0. Many Thanks, Eliano -- View this message in context: http://r.789695.n4.nabble.com/Sampling-with-Constraints-for-testing-and-training-data-tp4325530p4327028.html Sent from the R help mailing list archive at Nabble.com.
Petr Savicky
2012-Jan-25 15:17 UTC
[R] Sampling with Constraints for testing and training data
On Wed, Jan 25, 2012 at 04:00:27AM -0800, Eliano wrote:> Hi People, > > Does anyone have a good solution for this problem: > > a database called DB. > > > index <- sample(1:nrow(DB), size=0.2*nrow(BD)) > test <- DB[index,] > train <- DB[-index,] > > One of the variables in this database contais a target variable with two > values 0 and 1. > > Imagine now that i want to constraint the test data frame so the 20% of the > size of "test" has 50% of DB$target. > > Imagine: n=100 > DB$target = { 0=80 > 1=20} > > test=20 and contain 10 random values of DB$target=1 and 10 random values of > DB$target=0.Hi. One way is as follows. t0 <- which(DB$target==0) t1 <- which(DB$target==1) m <- round(0.1*nrow(DB)) stopifnot(length(t0) >= m & length(t1) >= m) index <- c(sample(t0, size=m), sample(t1, size=m)) Hope this helps. Petr Savicky.