francy
2011-Oct-08 14:04 UTC
[R] Permutation or Bootstrap to obtain p-value for one sample
Hi, I am having trouble understanding how to approach a simulation: I have a sample of n=250 from a population of N=2,000 individuals, and I would like to use either permutation test or bootstrap to test whether this particular sample is significantly different from the values of any other random samples of the same population. I thought I needed to take random samples (but I am not sure how many simulations I need to do) of n=250 from the N=2,000 population and maybe do a one-sample t-test to compare the mean score of all the simulated samples, + the one sample I am trying to prove that is different from any others, to the mean value of the population. But I don't know: (1) whether this one-sample t-test would be the right way to do it, and how to go about doing this in R (2) whether a permutation test or bootstrap methods are more appropriate This is the data frame that I have, which is to be sampled: df<- i.e. x y 1 2 3 4 5 6 7 8 . . . . . . 2,000 I have this sample from df, and would like to test whether it is has extreme values of y. sample1<- i.e. x y 3 4 7 8 . . . . . . 250 For now I only have this: R=999 #Number of simulations, but I don't know how many... t.values =numeric(R) #creates a numeric vector with 999 elements, which will hold the results of each simulation. for (i in 1:R) { sample1 <- df[sample(nrow(df), 250, replace=TRUE),] But I don't know how to continue the loop: do I calculate the mean for each simulation and compare it to the population mean? Any help you could give me would be very appreciated, Thank you. -- View this message in context: http://r.789695.n4.nabble.com/Permutation-or-Bootstrap-to-obtain-p-value-for-one-sample-tp3885118p3885118.html Sent from the R help mailing list archive at Nabble.com.
Ken Hutchison
2011-Oct-08 23:27 UTC
[R] Permutation or Bootstrap to obtain p-value for one sample
Hi Francy, A bootstrap test would likely be sufficient for this problem, but a one-sample t-test isn't advisable or necessary in my opinion. If you use a t-test multiple times you are making assumptions about the distribution of your data; more importantly, your probability of Type 1 error will be increased with each test. So, a valid thing to do would be to sample (computation for this problem won't be expensive so do alotta reps) and compare your mean to the null distribution of means. I.E. nreps=10000 mean.dist=rep(NA,nreps) for(replication in 1:nreps) { my.sample=sample(population, 2500, replace=T) #replace could be false, depends on preference mean.for.rep=mean(my.sample) #mean for this replication mean.dist[replication]=mean.for.rep #store the mean } hist(mean.dist,main="Null Dist of Means", col="chartreuse") #Show the means in a nifty color You can then perform various tests given the null distribution, or infer from where your sample mean lies within the distribution or something to that effect. Also, if the distribution is normal, which is somewhat likely since it is a distribution of means: (shapiro.test or require(nortest) ad.test will let you know) you should be able to make inference from that using parametric methods (once) which will fit the truth a bit better than a t.test. Hope that's helpful, Ken Hutchison On Sat, Oct 8, 2011 at 10:04 AM, francy <francy.casalino@gmail.com> wrote:> Hi, > > I am having trouble understanding how to approach a simulation: > > I have a sample of n=250 from a population of N=2,000 individuals, and I > would like to use either permutation test or bootstrap to test whether this > particular sample is significantly different from the values of any other > random samples of the same population. I thought I needed to take random > samples (but I am not sure how many simulations I need to do) of n=250 from > the N=2,000 population and maybe do a one-sample t-test to compare the mean > score of all the simulated samples, + the one sample I am trying to prove > that is different from any others, to the mean value of the population. But > I don't know: > (1) whether this one-sample t-test would be the right way to do it, and how > to go about doing this in R > (2) whether a permutation test or bootstrap methods are more appropriate > > This is the data frame that I have, which is to be sampled: > df<- > i.e. > x y > 1 2 > 3 4 > 5 6 > 7 8 > . . > . . > . . > 2,000 > > I have this sample from df, and would like to test whether it is has > extreme > values of y. > sample1<- > i.e. > x y > 3 4 > 7 8 > . . > . . > . . > 250 > > For now I only have this: > > R=999 #Number of simulations, but I don't know how many... > t.values =numeric(R) #creates a numeric vector with 999 elements, which > will hold the results of each simulation. > for (i in 1:R) { > sample1 <- df[sample(nrow(df), 250, replace=TRUE),] > > But I don't know how to continue the loop: do I calculate the mean for each > simulation and compare it to the population mean? > Any help you could give me would be very appreciated, > Thank you. > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Permutation-or-Bootstrap-to-obtain-p-value-for-one-sample-tp3885118p3885118.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
peter dalgaard
2011-Oct-09 07:52 UTC
[R] Permutation or Bootstrap to obtain p-value for one sample
On Oct 8, 2011, at 16:04 , francy wrote:> Hi, > > I am having trouble understanding how to approach a simulation: > > I have a sample of n=250 from a population of N=2,000 individuals, and I > would like to use either permutation test or bootstrap to test whether this > particular sample is significantly different from the values of any other > random samples of the same population. I thought I needed to take random > samples (but I am not sure how many simulations I need to do) of n=250 from > the N=2,000 population and maybe do a one-sample t-test to compare the mean > score of all the simulated samples, + the one sample I am trying to prove > that is different from any others, to the mean value of the population. But > I don't know: > (1) whether this one-sample t-test would be the right way to do it, and how > to go about doing this in R > (2) whether a permutation test or bootstrap methods are more appropriate > > This is the data frame that I have, which is to be sampled: > df<- > i.e. > x y > 1 2 > 3 4 > 5 6 > 7 8 > . . > . . > . . > 2,000 > > I have this sample from df, and would like to test whether it is has extreme > values of y. > sample1<- > i.e. > x y > 3 4 > 7 8 > . . > . . > . . > 250 > > For now I only have this: > > R=999 #Number of simulations, but I don't know how many... > t.values =numeric(R) #creates a numeric vector with 999 elements, which > will hold the results of each simulation. > for (i in 1:R) { > sample1 <- df[sample(nrow(df), 250, replace=TRUE),] > > But I don't know how to continue the loop: do I calculate the mean for each > simulation and compare it to the population mean? > Any help you could give me would be very appreciated, > Thank you.The straightforward way would be a permutation test, something like this msamp <- mean(sample1$y) mpop <- mean(df$y) msim <- replicate(10000, mean(sample(df$y, 250))) sum(abs(msim-mpop) >= abs(msamp-mpop))/10000 I don't really see a reason to do bootstrapping here. You say you want to test whether your sample could be a random sample from the population, so just simulate that sampling (which should be without replacement, like your sample is). Bootstrapping might come in if you want a confidence interval for the mean difference between your sample and the rest. Instead of sampling means, you could put a full-blown t-test inside the replicate expression, like: psim <- replicate(10000, {s<-sample(1:2000, 250); t.test(df$y[s], df$y[-s])$p.value}) and then check whether the p value for your sample is small compared to the distribution of values in psim. That'll take quite a bit longer, though; t.test() is a more complex beast than mean(). It is not obvious that it has any benefits either, unless you specifically wanted to investigate the behavior of the t test. (All code untested. Caveat emptor.) -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com