Thank you all, very much, for your kind and detailed explanations. I
didn't understand, mainly, that the matrix() call only called its
parameters once. I was certain that this was a bug with sample()
getting seeded with a constant value, and giving the same permutation.
I think I need to make my MWE a little less minimal to continue
learning. If you're familiar with the Lock5 statistics textbook, I'm
working on the Light and Dark mice example, where groups of mice were
exposed or not to light at night, then measured for weight gain. The
statistic is mean difference in weight gain between the two groups.
My understanding of how I'm supposed to construct a randomized
distribution is to join the weight gains of the 10 mice exposed to
light at night to the 8 mice not exposed to light at night. After
shuffling this data, I arbitrarily group the first 10 values into the
'light' group, and the last 8 into the 'dark' group, and find
the
difference in their means.
I think I can do this correctly with:
==================## Less-minimal working example
library(tidyverse)
library(Lock5Data)
data(LightatNight)
str(LightatNight)
## Or, if you don't have the Lock5Data library:
(d <-
read_csv("https://www.lock5stat.com/datasets3e/LigthtatNight.csv"))
(lt <- d$BMGain[d$Group == "Light"])
(dk <- d$BMGain[d$Group == "Dark"])
(n_lt <- length(lt))
(n_dk <- length(dk))
(data <- c(lt, dk))
B <- 10 #Will be 1000
n <- length(data)
random.samples <- matrix(NA, B, n)
random.statistics <- rep(NA, B)
for(i in 1:B) {
random.samples[i,] <- sample(data)
random.statistics[i] <- mean(random.samples[i, 1:n_lt]) -
mean(random.samples[i, (n_lt + 1):(n_lt + n_dk)])
}
random.samples
random.statistics
## Trying to do it without a for(), using Peter's suggestion:
(random.samples <- matrix(replicate(B, sample(data)), B, n,
byrow=TRUE))
compute.diff.means <- function(x) {
return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)]))
}
(random.statistics <- apply(random.samples, 1, compute.diff.means))
======================
I think both of these methods give me the data I'm trying for. Any
suggestions on my R coding techniques are welcome.
Thank you all, again, for taking the time and effort to help me. Your
help is greatly appreciated.
-Kevin
On Thu, 2025-03-13 at 17:00 -0400, Kevin Zembower wrote:> Hello, all,
>
> I'm learning to do randomized distributions in my Stats 101 class*. I
> thought I could do it with a call to sample() inside a matrix(),
> like:
>
> > matrix(sample(1:10, replace=TRUE), 5, 10, byrow=TRUE)
> ???? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,]??? 8??? 2??? 3??? 1??? 8??? 2??? 8??? 8??? 9???? 8
> [2,]??? 8??? 2??? 3??? 1??? 8??? 2??? 8??? 8??? 9???? 8
> [3,]??? 8??? 2??? 3??? 1??? 8??? 2??? 8??? 8??? 9???? 8
> [4,]??? 8??? 2??? 3??? 1??? 8??? 2??? 8??? 8??? 9???? 8
> [5,]??? 8??? 2??? 3??? 1??? 8??? 2??? 8??? 8??? 9???? 8
> >
>
> Imagine my surprise to learn that all the rows were the same
> permutation. I thought each time sample() was called inside the
> matrix,
> it would generate a different permutation.
>
> I modeled this after the bootstrap sample techniques in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf. I don't
> understand why it works in bootstrap samples (with replace=TRUE), but
> not in randomized distributions (with replace=FALSE).
>
> Thanks for any insight you can share with me, and any suggestions for
> getting rows in a matrix with different permutations.
>
> -Kevin
>
> *No, this isn't a homework problem. We're using Lock5 as the text
in
> class, along with its StatKey web application. I'm just trying to get
> more out of the class by also solving our problems using R, for which
> I'm not receiving any class credit.
@vi@e@gross m@iii@g oii gm@ii@com
2025-Mar-14 22:19 UTC
[R] What don't I understand about sample()?
Kevin,
I was amused by the use of the parentheses wrapping to get the REPL to show the
effects of an assignment but would remove that in any final program if the
output is not needed.
I am not saying this is wrong, nor what I describe below, but just a discussion
of how others might do it.
I do somewhat wonder about the way you define the function below:
compute.diff.means <- function(x) {
return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)]))
}
It is a function of x which you pass in but makes use of an external variable
that it needs to find in the environment several times, n_lt, as well as n_dk.
But these assignments happen just once in your code:
(n_lt <- length(lt))
(n_dk <- length(dk))
Some people would write the function to include the variables as in:
compute.diff.means <- function(x, n_lt, n_dk) {
return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)]))
}
The names of the variables can be the same or new ones.
The reason you do this seems to be that you are using "apply" as shown
below and may not know it can accommodate additional argument.
(random.statistics <- apply(random.samples, 1, compute.diff.means))
The above is an implicit loop that calls compute.diff.means() repeatedly over
each row of your matrix. It passes the specific row as a vector as the first and
only argument.
If you ask "?apply" to document what the apply function does, you may
note that like some other such functions, there is a "..." that
actually means anything else you supply as extra arguments are passed along to
the function. So, since your variables are not changing, then code like this:
(random.statistics <- apply(random.samples, 1, compute.diff.means, n_lt,
n_dk))
Will call a function with a row vector and then the additional two arguments so
each call will be to:
compute.diff.means(ROW, n_lt, n_dk)
Arguably, this approach may be no better but in some sense makes your function
more portable and cleaner. If your code continued and did additional analyses
like this, the function might be more easily re-usable.
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Kevin Zembower
via R-help
Sent: Friday, March 14, 2025 2:52 PM
To: r-help at r-project.org
Subject: Re: [R] What don't I understand about sample()?
Thank you all, very much, for your kind and detailed explanations. I
didn't understand, mainly, that the matrix() call only called its
parameters once. I was certain that this was a bug with sample()
getting seeded with a constant value, and giving the same permutation.
I think I need to make my MWE a little less minimal to continue
learning. If you're familiar with the Lock5 statistics textbook, I'm
working on the Light and Dark mice example, where groups of mice were
exposed or not to light at night, then measured for weight gain. The
statistic is mean difference in weight gain between the two groups.
My understanding of how I'm supposed to construct a randomized
distribution is to join the weight gains of the 10 mice exposed to
light at night to the 8 mice not exposed to light at night. After
shuffling this data, I arbitrarily group the first 10 values into the
'light' group, and the last 8 into the 'dark' group, and find
the
difference in their means.
I think I can do this correctly with:
==================## Less-minimal working example
library(tidyverse)
library(Lock5Data)
data(LightatNight)
str(LightatNight)
## Or, if you don't have the Lock5Data library:
(d <-
read_csv("https://www.lock5stat.com/datasets3e/LigthtatNight.csv"))
(lt <- d$BMGain[d$Group == "Light"])
(dk <- d$BMGain[d$Group == "Dark"])
(n_lt <- length(lt))
(n_dk <- length(dk))
(data <- c(lt, dk))
B <- 10 #Will be 1000
n <- length(data)
random.samples <- matrix(NA, B, n)
random.statistics <- rep(NA, B)
for(i in 1:B) {
random.samples[i,] <- sample(data)
random.statistics[i] <- mean(random.samples[i, 1:n_lt]) -
mean(random.samples[i, (n_lt + 1):(n_lt + n_dk)])
}
random.samples
random.statistics
## Trying to do it without a for(), using Peter's suggestion:
(random.samples <- matrix(replicate(B, sample(data)), B, n,
byrow=TRUE))
compute.diff.means <- function(x) {
return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)]))
}
(random.statistics <- apply(random.samples, 1, compute.diff.means))
======================
I think both of these methods give me the data I'm trying for. Any
suggestions on my R coding techniques are welcome.
Thank you all, again, for taking the time and effort to help me. Your
help is greatly appreciated.
-Kevin
On Thu, 2025-03-13 at 17:00 -0400, Kevin Zembower wrote:> Hello, all,
>
> I'm learning to do randomized distributions in my Stats 101 class*. I
> thought I could do it with a call to sample() inside a matrix(),
> like:
>
> > matrix(sample(1:10, replace=TRUE), 5, 10, byrow=TRUE)
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,] 8 2 3 1 8 2 8 8 9 8
> [2,] 8 2 3 1 8 2 8 8 9 8
> [3,] 8 2 3 1 8 2 8 8 9 8
> [4,] 8 2 3 1 8 2 8 8 9 8
> [5,] 8 2 3 1 8 2 8 8 9 8
> >
>
> Imagine my surprise to learn that all the rows were the same
> permutation. I thought each time sample() was called inside the
> matrix,
> it would generate a different permutation.
>
> I modeled this after the bootstrap sample techniques in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf. I don't
> understand why it works in bootstrap samples (with replace=TRUE), but
> not in randomized distributions (with replace=FALSE).
>
> Thanks for any insight you can share with me, and any suggestions for
> getting rows in a matrix with different permutations.
>
> -Kevin
>
> *No, this isn't a homework problem. We're using Lock5 as the text
in
> class, along with its StatKey web application. I'm just trying to get
> more out of the class by also solving our problems using R, for which
> I'm not receiving any class credit.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Not having the book (and which of the three editions are you using?), I downloaded the data and played with it for a bit. dotchart() showed the Dark and Light conditions looked quite different, but also showed that there are not very many cases. After trying t.test, it occurred to me that I did not know whether "BMGain" means gain in *grams* or gain in *percent*. Reflection told me that for a growth experiment, percent made more sense, which reminded my of one of my first student advising experiences, where I said "never give the computer percentages; let IT calculate the percentages from the baseline and outcome, because once you've thrown away information, the computer can't magically get it back." In particular, in the real world I'd be worried about the possibility that there was some confounding going on, so I would much rather have initial weight and final weight as variables. If BMGain is an absolute measure, the p value for a t test is teeny tiny. If BMGain is a percentage, the p value for a sensible t test is about 0.03. A permutation test went like this. is.light <- d$Group == "Light" is.dark <- d$Group == "Dark" score <- function (g) mean(g[is.light]) - mean(g[is.dark]) base.score <- score(d$BMGain) perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain))) sum(perm.scores >= base.score) / length(perm.scores) I don't actually see where matrix() comes into it, still less anything in the tidyverse. On Sat, 15 Mar 2025 at 07:52, Kevin Zembower via R-help <r-help at r-project.org> wrote:> > Thank you all, very much, for your kind and detailed explanations. I > didn't understand, mainly, that the matrix() call only called its > parameters once. I was certain that this was a bug with sample() > getting seeded with a constant value, and giving the same permutation. > > I think I need to make my MWE a little less minimal to continue > learning. If you're familiar with the Lock5 statistics textbook, I'm > working on the Light and Dark mice example, where groups of mice were > exposed or not to light at night, then measured for weight gain. The > statistic is mean difference in weight gain between the two groups. > > My understanding of how I'm supposed to construct a randomized > distribution is to join the weight gains of the 10 mice exposed to > light at night to the 8 mice not exposed to light at night. After > shuffling this data, I arbitrarily group the first 10 values into the > 'light' group, and the last 8 into the 'dark' group, and find the > difference in their means. > > I think I can do this correctly with: > ==================> ## Less-minimal working example > library(tidyverse) > > library(Lock5Data) > data(LightatNight) > str(LightatNight) > > ## Or, if you don't have the Lock5Data library: > (d <- > read_csv("https://www.lock5stat.com/datasets3e/LigthtatNight.csv")) > > (lt <- d$BMGain[d$Group == "Light"]) > (dk <- d$BMGain[d$Group == "Dark"]) > (n_lt <- length(lt)) > (n_dk <- length(dk)) > > (data <- c(lt, dk)) > > B <- 10 #Will be 1000 > n <- length(data) > > random.samples <- matrix(NA, B, n) > random.statistics <- rep(NA, B) > > for(i in 1:B) { > random.samples[i,] <- sample(data) > random.statistics[i] <- mean(random.samples[i, 1:n_lt]) - > mean(random.samples[i, (n_lt + 1):(n_lt + n_dk)]) > } > random.samples > random.statistics > > ## Trying to do it without a for(), using Peter's suggestion: > (random.samples <- matrix(replicate(B, sample(data)), B, n, > byrow=TRUE)) > compute.diff.means <- function(x) { > return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)])) > } > (random.statistics <- apply(random.samples, 1, compute.diff.means)) > ======================> > I think both of these methods give me the data I'm trying for. Any > suggestions on my R coding techniques are welcome. > > Thank you all, again, for taking the time and effort to help me. Your > help is greatly appreciated. > > -Kevin > > On Thu, 2025-03-13 at 17:00 -0400, Kevin Zembower wrote: > > Hello, all, > > > > I'm learning to do randomized distributions in my Stats 101 class*. I > > thought I could do it with a call to sample() inside a matrix(), > > like: > > > > > matrix(sample(1:10, replace=TRUE), 5, 10, byrow=TRUE) > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > > [1,] 8 2 3 1 8 2 8 8 9 8 > > [2,] 8 2 3 1 8 2 8 8 9 8 > > [3,] 8 2 3 1 8 2 8 8 9 8 > > [4,] 8 2 3 1 8 2 8 8 9 8 > > [5,] 8 2 3 1 8 2 8 8 9 8 > > > > > > > Imagine my surprise to learn that all the rows were the same > > permutation. I thought each time sample() was called inside the > > matrix, > > it would generate a different permutation. > > > > I modeled this after the bootstrap sample techniques in > > https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf. I don't > > understand why it works in bootstrap samples (with replace=TRUE), but > > not in randomized distributions (with replace=FALSE). > > > > Thanks for any insight you can share with me, and any suggestions for > > getting rows in a matrix with different permutations. > > > > -Kevin > > > > *No, this isn't a homework problem. We're using Lock5 as the text in > > class, along with its StatKey web application. I'm just trying to get > > more out of the class by also solving our problems using R, for which > > I'm not receiving any class credit. > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.