Els Verfaillie
2009-Feb-05 10:45 UTC
[R] split dataset randomly in prediction and validation set
For a geostatistical analysis, I would like to split my dataset randomly into 2 parts: a prediction set (with 2/3 of my data) and a validation set (with 1/3 of my data). Both datasets will thus contain different data. Any suggestions? [[alternative HTML version deleted]]
jim holtman
2009-Feb-05 14:10 UTC
[R] split dataset randomly in prediction and validation set
?sample> x <- 1:100 # test data > y <- split(x, sample(1:2, length(x), replace=TRUE, prob=c(1,2))) > > y$`1` [1] 4 6 7 13 15 17 18 20 21 29 35 36 37 39 41 43 46 49 50 52 61 68 70 72 76 77 79 80 82 85 87 94 95 96 99 $`2` [1] 1 2 3 5 8 9 10 11 12 14 16 19 22 23 24 25 26 27 28 30 31 32 33 34 38 40 42 [28] 44 45 47 48 51 53 54 55 56 57 58 59 60 62 63 64 65 66 67 69 71 73 74 75 78 81 83 [55] 84 86 88 89 90 91 92 93 97 98 100 On Thu, Feb 5, 2009 at 5:45 AM, Els Verfaillie <els.verfaillie at ugent.be> wrote:> > > For a geostatistical analysis, I would like to split my dataset randomly > into 2 parts: a prediction set (with 2/3 of my data) and a validation set > (with 1/3 of my data). Both datasets will thus contain different data. Any > suggestions? > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Dieter Menne
2009-Feb-05 16:09 UTC
[R] split dataset randomly in prediction and validation set
Els Verfaillie <els.verfaillie <at> ugent.be> writes:> For a geostatistical analysis, I would like to split my dataset randomly > into 2 parts: a prediction set (with 2/3 of my data) and a validation set > (with 1/3 of my data). Both datasets will thus contain different data. Any > suggestions?Normally, you will not do this once, but round-robin. There are a few packages around that help you in doing this (check for cross-validation), but in most cases doing it by hand can be easier to understand 4 years later. Dieter # randomize your data; may not be required set.seed(4711) df = data.frame(x=rnorm(100),y=rnorm(100))[sample(1:nrow(df)),] ncrossval = 3 # Fiddling required when length of data is not evenly divisble by ncrossval df$group = rep(1:ncrossval,nrow(df)/+1)[1:nrow(df)] for (group in 1:ncrossval) { small = df[df$group==group,] big = df[df$group!=group,] # do your work with small and big str(small) str(big) }