thr3ads.net - R help - [R] how to control the sampling to make each sample unique [May 2007]

If this information is useful, please help other people find it:
Share via:

HelponR

2007-May-10 00:28 UTC

[R] how to control the sampling to make each sample unique

I have a dataset of 10000 records which I want to use to compare two
prediction models.

I split the records into test dataset (size = ntest) and training dataset
(size = ntrain). Then I run the two models.

Now I want to shuffle the data and rerun the models. I want many shuffles.

I know that the following command

sample ((1:10000), ntrain)

can pick ntrain numbers from 1 to 10000. Then I just use these rows as the
training dataset.

But how can I make sure each run of sample  produce different results? I
want the data output be unique each time.
I tested sample(). and found it usually produce different combinations. But
can I control it some how? Is there a better way to write this?

Thank you,

	[[alternative HTML version deleted]]

Vladimir Eremeev

2007-May-10 09:40 UTC

head link

[R] how to control the sampling to make each sample unique

Urania Sun wrote:> 
> I have a dataset of 10000 records which I want to use to compare two
> prediction models.
> 
> I split the records into test dataset (size = ntest) and training dataset
> (size = ntrain). Then I run the two models.
> 
> Now I want to shuffle the data and rerun the models. I want many shuffles.
> 
> I know that the following command
> 
> sample ((1:10000), ntrain)
> 
> can pick ntrain numbers from 1 to 10000. Then I just use these rows as the
> training dataset.
> 
> But how can I make sure each run of sample  produce different results? I
> want the data output be unique each time.
> I tested sample(). and found it usually produce different combinations.
> But
> can I control it some how? Is there a better way to write this?
> 
> Thank you,
> 
> 
You could have numbers, not picked yet, in a vector, use this vector with
sample and remove picked numbers from it iteratively.

Something like the following (not fully tested)

index<-1:10000

for( blah-blah-blah ) {
  train.index<-sample(index,ntrain)
  index<-index[!index %in% train.index]
  test.index<-sample(index,ntest)
  index<-index[!index %in% test.index]
}

-- 
View this message in context:
http://www.nabble.com/how-to-control-the-sampling-to-make-each-sample-unique-tf3719058.html#a10410229
Sent from the R help mailing list archive at Nabble.com.

Rory Martin

2007-May-10 13:09 UTC

head link

[R] how to control the sampling to make each sample unique

I think you're asking a design question about a Monte Carlo simulation.  You
have a "population" (size 10,000) from which you're defining an
empirical
distribution, and you're sampling from this to create pairs of training and
test samples.

You need to ensure that each specific pair of training and test samples is
disjoint, meaning no observations in common.  Normally, you wouldn't want to
make the different training samples disjoint, if that's what you meant by
them being "unique".  Or were you using it to mean
"identical"?

Regards
Rory Martin

> From: HelponR <suncertain_at_gmail.com> Date: Wed, 09 May 2007
17:28:19
>
> I have a dataset of 10000 records which I want to use to compare two
> prediction models.
>
> I split the records into test dataset (size = ntest) and training dataset
> (size = ntrain). Then I run the two models.
>
> Now I want to shuffle the data and rerun the models. I want many shuffles.
>
> I know that the following command
>
> sample ((1:10000), ntrain)
>
> can pick ntrain numbers from 1 to 10000. Then I just use these rows as the
> training dataset.
>
> But how can I make sure each run of sample produce different results? I
> want the data output be unique each time. I tested sample(). and found it
> usually produce different combinations. But can I control it some how? Is
> there a better way to write this?

HelponR

2007-May-10 21:06 UTC

head link

[R] how to control the sampling to make each sample unique

I know. But I am curious about how sample() works.

For a small sample size. choose 1 digit from 0, 1
it only has two combinations. It is easy to test that the below can happen
consecutively.
> sample (c(0,1), 1)
[1] 0> sample (c(0,1), 1)[1] 0

That means, the output did not deplete all unique combinations before
repeating.

So I am concerned about how to control this. What I like to see after
the control is:> sample (c(0,1), 1)
[1] 0> sample (c(0,1), 1)
[1] 1> sample (c(0,1), 1)[1] 0

I don't think that is possible. Anyway, I just think a way to control is
recording all output in files, checking the new output, if they are
repeating with any of the previous files, then do not use it.
That is kind of clumsy. For each new combination, I have to compare with all
previous combinations.

First I sort the sequence, then I do a difference. then I square it, then I
sum it. If the result is 0 then a repetition happens.


Thanks all.



On 5/10/07, Rory Martin <rory.martin@comcast.net>
wrote:>
>  sample(1:1000, 4000) returns a =random= sample of 4000
> integers from [1,1000].  It is exceedingly unlikely
> you will generate exactly the same set of 4000 integers.
> And if it did happen, it wouldn't make the slightest
> difference to your results.
>
> Rory
>
>
>
> ----- Original Message -----
> *From:* HelponR <suncertain@gmail.com>
> *To:* Rory Martin <rory.martin@comcast.net>
> *Cc:* r-help@stat.math.ethz.ch
> *Sent:* Thursday, May 10, 2007 4:47 PM
> *Subject:* Re: [R] how to control the sampling to make each sample unique
>
>
> Yeah, I want to get all unique combinations of choosing ntest from ntotal.
>
> for example, choosing 4000 training data from 10,000 total data.
>
> Suppose they are sequenced as 1:10,000
>
> One obvious combination is 1:4000
>
> Then I run
>
> sample ((1:1000), 4000)
>
> it may output 4000 numbers:
>
> 1, 3, 5, .... 7999
>
> Then I run again,
>
> it may output another 4000 numbers:
>
> 2, 4, 6, ..., 8000
>
> I know the number of such unique combinations is
>
> Choose 4000 from 10,000
>
> (I forgot how to denote this.)
>
> Anyway, I remember choosing m from n is  computed as
> T = n! /(m!(m-n)!)
>
> ! is factorial
>
>
> My concern is:
> when the sample output will start to repeat?
>
> For example, maybe I run next time, the output will be the same as the
> first time.
> 1,2, 3, ...., 4000
> That's not what I want.
>
> I hope to get T different or unique combinations in T runs. It is fine it
> may start to repeat after T times.
>
> I know the sample() may already do this way. But I am not sure.
>
>
> Thank you!
>
>
>
> On 5/10/07, Rory Martin <rory.martin@comcast.net> wrote:
> >
> > I think you're asking a design question about a Monte Carlo
> > simulation.  You
> > have a "population" (size 10,000) from which you're
defining an
> > empirical
> > distribution, and you're sampling from this to create pairs of
training
> > and
> > test samples.
> >
> > You need to ensure that each specific pair of training and test
samples
> > is
> > disjoint, meaning no observations in common.  Normally, you
wouldn't
> > want to
> > make the different training samples disjoint, if that's what you
meant
> > by
> > them being "unique".  Or were you using it to mean
"identical"?
> >
> > Regards
> > Rory Martin
> >
> >
> > > From: HelponR <suncertain_at_gmail.com> Date: Wed, 09 May
2007
> > 17:28:19
> > >
> > > I have a dataset of 10000 records which I want to use to compare
two
> > > prediction models.
> > >
> > > I split the records into test dataset (size = ntest) and training
> > dataset
> > > (size = ntrain). Then I run the two models.
> > >
> > > Now I want to shuffle the data and rerun the models. I want many
> > shuffles.
> > >
> > > I know that the following command
> > >
> > > sample ((1:10000), ntrain)
> > >
> > > can pick ntrain numbers from 1 to 10000. Then I just use these
rows as
> > the
> > > training dataset.
> > >
> > > But how can I make sure each run of sample produce different
results?
> > I
> > > want the data output be unique each time. I tested sample(). and
found
> > it
> > > usually produce different combinations. But can I control it some
how?
> > Is
> > > there a better way to write this?
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> >
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more possibly parallel threads

R help - May 2007 - how to control the sampling to make each sample unique

[R] how to control the sampling to make each sample unique

[R] how to control the sampling to make each sample unique

[R] how to control the sampling to make each sample unique

[R] how to control the sampling to make each sample unique

Seemingly Similar Threads