I have a dataset of 10000 records which I want to use to compare two prediction models. I split the records into test dataset (size = ntest) and training dataset (size = ntrain). Then I run the two models. Now I want to shuffle the data and rerun the models. I want many shuffles. I know that the following command sample ((1:10000), ntrain) can pick ntrain numbers from 1 to 10000. Then I just use these rows as the training dataset. But how can I make sure each run of sample produce different results? I want the data output be unique each time. I tested sample(). and found it usually produce different combinations. But can I control it some how? Is there a better way to write this? Thank you, [[alternative HTML version deleted]]
Vladimir Eremeev
2007-May-10 09:40 UTC
[R] how to control the sampling to make each sample unique
Urania Sun wrote:> > I have a dataset of 10000 records which I want to use to compare two > prediction models. > > I split the records into test dataset (size = ntest) and training dataset > (size = ntrain). Then I run the two models. > > Now I want to shuffle the data and rerun the models. I want many shuffles. > > I know that the following command > > sample ((1:10000), ntrain) > > can pick ntrain numbers from 1 to 10000. Then I just use these rows as the > training dataset. > > But how can I make sure each run of sample produce different results? I > want the data output be unique each time. > I tested sample(). and found it usually produce different combinations. > But > can I control it some how? Is there a better way to write this? > > Thank you, > >You could have numbers, not picked yet, in a vector, use this vector with sample and remove picked numbers from it iteratively. Something like the following (not fully tested) index<-1:10000 for( blah-blah-blah ) { train.index<-sample(index,ntrain) index<-index[!index %in% train.index] test.index<-sample(index,ntest) index<-index[!index %in% test.index] } -- View this message in context: http://www.nabble.com/how-to-control-the-sampling-to-make-each-sample-unique-tf3719058.html#a10410229 Sent from the R help mailing list archive at Nabble.com.
Rory Martin
2007-May-10 13:09 UTC
[R] how to control the sampling to make each sample unique
I think you're asking a design question about a Monte Carlo simulation. You have a "population" (size 10,000) from which you're defining an empirical distribution, and you're sampling from this to create pairs of training and test samples. You need to ensure that each specific pair of training and test samples is disjoint, meaning no observations in common. Normally, you wouldn't want to make the different training samples disjoint, if that's what you meant by them being "unique". Or were you using it to mean "identical"? Regards Rory Martin> From: HelponR <suncertain_at_gmail.com> Date: Wed, 09 May 2007 17:28:19 > > I have a dataset of 10000 records which I want to use to compare two > prediction models. > > I split the records into test dataset (size = ntest) and training dataset > (size = ntrain). Then I run the two models. > > Now I want to shuffle the data and rerun the models. I want many shuffles. > > I know that the following command > > sample ((1:10000), ntrain) > > can pick ntrain numbers from 1 to 10000. Then I just use these rows as the > training dataset. > > But how can I make sure each run of sample produce different results? I > want the data output be unique each time. I tested sample(). and found it > usually produce different combinations. But can I control it some how? Is > there a better way to write this?
I know. But I am curious about how sample() works. For a small sample size. choose 1 digit from 0, 1 it only has two combinations. It is easy to test that the below can happen consecutively.> sample (c(0,1), 1)[1] 0> sample (c(0,1), 1)[1] 0 That means, the output did not deplete all unique combinations before repeating. So I am concerned about how to control this. What I like to see after the control is:> sample (c(0,1), 1)[1] 0> sample (c(0,1), 1)[1] 1> sample (c(0,1), 1)[1] 0 I don't think that is possible. Anyway, I just think a way to control is recording all output in files, checking the new output, if they are repeating with any of the previous files, then do not use it. That is kind of clumsy. For each new combination, I have to compare with all previous combinations. First I sort the sequence, then I do a difference. then I square it, then I sum it. If the result is 0 then a repetition happens. Thanks all. On 5/10/07, Rory Martin <rory.martin@comcast.net> wrote:> > sample(1:1000, 4000) returns a =random= sample of 4000 > integers from [1,1000]. It is exceedingly unlikely > you will generate exactly the same set of 4000 integers. > And if it did happen, it wouldn't make the slightest > difference to your results. > > Rory > > > > ----- Original Message ----- > *From:* HelponR <suncertain@gmail.com> > *To:* Rory Martin <rory.martin@comcast.net> > *Cc:* r-help@stat.math.ethz.ch > *Sent:* Thursday, May 10, 2007 4:47 PM > *Subject:* Re: [R] how to control the sampling to make each sample unique > > > Yeah, I want to get all unique combinations of choosing ntest from ntotal. > > for example, choosing 4000 training data from 10,000 total data. > > Suppose they are sequenced as 1:10,000 > > One obvious combination is 1:4000 > > Then I run > > sample ((1:1000), 4000) > > it may output 4000 numbers: > > 1, 3, 5, .... 7999 > > Then I run again, > > it may output another 4000 numbers: > > 2, 4, 6, ..., 8000 > > I know the number of such unique combinations is > > Choose 4000 from 10,000 > > (I forgot how to denote this.) > > Anyway, I remember choosing m from n is computed as > T = n! /(m!(m-n)!) > > ! is factorial > > > My concern is: > when the sample output will start to repeat? > > For example, maybe I run next time, the output will be the same as the > first time. > 1,2, 3, ...., 4000 > That's not what I want. > > I hope to get T different or unique combinations in T runs. It is fine it > may start to repeat after T times. > > I know the sample() may already do this way. But I am not sure. > > > Thank you! > > > > On 5/10/07, Rory Martin <rory.martin@comcast.net> wrote: > > > > I think you're asking a design question about a Monte Carlo > > simulation. You > > have a "population" (size 10,000) from which you're defining an > > empirical > > distribution, and you're sampling from this to create pairs of training > > and > > test samples. > > > > You need to ensure that each specific pair of training and test samples > > is > > disjoint, meaning no observations in common. Normally, you wouldn't > > want to > > make the different training samples disjoint, if that's what you meant > > by > > them being "unique". Or were you using it to mean "identical"? > > > > Regards > > Rory Martin > > > > > > > From: HelponR <suncertain_at_gmail.com> Date: Wed, 09 May 2007 > > 17:28:19 > > > > > > I have a dataset of 10000 records which I want to use to compare two > > > prediction models. > > > > > > I split the records into test dataset (size = ntest) and training > > dataset > > > (size = ntrain). Then I run the two models. > > > > > > Now I want to shuffle the data and rerun the models. I want many > > shuffles. > > > > > > I know that the following command > > > > > > sample ((1:10000), ntrain) > > > > > > can pick ntrain numbers from 1 to 10000. Then I just use these rows as > > the > > > training dataset. > > > > > > But how can I make sure each run of sample produce different results? > > I > > > want the data output be unique each time. I tested sample(). and found > > it > > > usually produce different combinations. But can I control it some how? > > Is > > > there a better way to write this? > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > > >[[alternative HTML version deleted]]