thr3ads.net - R help - [R] What don't I understand about sample()? [Mar 2025]

If this information is useful, please help other people find it:
Share via:

Rui Barradas

2025-Mar-15 21:27 UTC

[R] What don't I understand about sample()?

Hello,

I have been following this thread and though answers have been given, 
some of them address R coding in general, not necessarily the sample() 
function. Here are some random notes I think the OP could use, prompted 
by the text linked to, chap3.pdf.

1.
Throughout the text, assignments use the equal sign instead of the left 
arrow. The left arrow is generally considered more idiomatic and there 
is an important diference beteewn he wo, see help("assignOps").


time.mean = with(CommuteAtlanta, mean(Time))
B = 1000
n = nrow(CommuteAtlanta)

# This should be used, not the above.
time.mean <- with(CommuteAtlanta, mean(Time))
B <- 1000
n <- nrow(CommuteAtlanta)


2.
Systematic use of apply(., 1, mean).
rowMeans (and colMeans) are much faster.


boot.statistics <- apply(boot.samples, 1, mean)
boot.statistics <- rowMeans(boot.samples)


3.
The first confidence interval computation seems awkward. I had never 
seen this way of computing a CI95.
Moreover, it's plain common sense to keep the results with the returned 
decimals and round for display purposes only.
And the normal intervals are computed in a more usual way later in the 
text, see sections 1.2 and 1.3.


me <- ceiling(10 * 2 * time.se)/10
round(time.mean, 1) + c(-1, 1) * me

# Straightforward.
normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
normal_ci95
round(normal_ci95, 1)

# section 1.2 , function boot.mean
interval = mean(x) + c(-1,1)*2*se

# section 1.3
with(students, mean(Height) + c(-1, 1) * 2 * sd(result))



4.
In section 1.2 there is a bootstrap function boot.mean().
The function could be improved to let users pass a conf.level of their 
choice.
And why force the function user to always have the plot displayed? Base 
functions hist() and barplot() have na argument 'plot' with default TRUE
allowing the user to choose.

The following seems more user friendly.


boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot = 
TRUE) {
   require(ggplot2)
   n <- length(x)
   boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
   boot.statistics <- rowMeans(boot.samples)
   se <- sd(boot.statistics)
   if ( is.null(binwidth) )
     binwidth <- diff(range(boot.statistics))/30

   p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
     geom_histogram(aes(y = ..density..),binwidth = binwidth) +
     geom_density(color = "red")

   alpha <- 1 - (1 - conf.level)/2
   interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se

   if(plot) {
     plot(p)
   }

   list(boot.statistics = boot.statistics, interval = interval, se = se, 
plot = p)
}


Hope this helps,

Rui Barradas



?s 17:28 de 15/03/2025, Kevin Zembower via R-help
escreveu:> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
> 
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
> 
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
> 
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
> 
> Thanks, again, for helping me to further understand R and this problem.
> 
> -Kevin
> 
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
>> Not having the book (and which of the three editions are you using?),
>> I downloaded the data and played with it for a bit.
>> dotchart() showed the Dark and Light conditions looked quite
>> different, but also showed that there are not very many cases.
>> After trying t.test, it occurred to me that I did not know whether
>> "BMGain" means gain in *grams* or gain in *percent*.
>> Reflection told me that for a growth experiment, percent made more
>> sense, which reminded my of one of my first
>> student advising experiences, where I said "never give the
computer
>> percentages; let IT calculate the percentages
>> from the baseline and outcome, because once you've thrown away
>> information, the computer can't magically get it back."
>> In particular, in the real world I'd be worried about the
possibility
>> that there was some confounding going on, so I would
>> much rather have initial weight and final weight as variables.
>> If BMGain is an absolute measure, the p value for a t test is teeny
>> tiny.
>> If BMGain is a percentage, the p value for a sensible t test is about
>> 0.03.
>>
>> A permutation test went like this.
>> is.light <- d$Group == "Light"
>> is.dark <- d$Group == "Dark"
>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>> base.score <- score(d$BMGain)
>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>> sum(perm.scores >= base.score) / length(perm.scores)
>>
>> I don't actually see where matrix() comes into it, still less
>> anything
>> in the tidyverse.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

@vi@e@gross m@iii@g oii gm@ii@com

2025-Mar-16 01:31 UTC

head link

[R] What don't I understand about sample()?

Rui,

I just want to discuss this point you made for now:

	2.
	Systematic use of apply(., 1, mean).
	rowMeans (and colMeans) are much faster.

You are, of course, technically correct. I consider an alternative aspect in
which you are teaching a programming language as a small set of
"commands" that can do quite a bit in various legal combinations.
These are often reserved words and primitives like "if" or operators
like "<-" and so on.

You then can discuss various built-in functionality that extends your abilities
such as an array of built in functions commonly used. I see these in several
categories. Some, can be examined in R at the prompt to see what R code produces
it. You can learn how it works and perhaps cannibalize from it to make your own
code that uses helpful parts and perhaps avoids functionality you do not need.

The other kinds are relatively hidden as .Internal or a function that after
someprocessing then calls .Internal(...) and can be code written in various
(perhaps usually compiled) languages such as a version of C,  or perhaps pulled
from libraries such as FORTRAN mathematical functions. These are sometimes
faster partially because they are not being read and interpreted, or perhaps use
cute programming tricks.

So, as I see it, whatever teacher or book or course is being used by the OP,
they are supposedly learning ways to approach problems from some level in which
they want them to learn about ideas. The goal is not just learning how to do one
problem, but to add to your tool bag for additional problems. And, in R, some of
those ideas revolve in how to do implicit loops and in some sense do vectorized
operations.

So, if I understood it, the assignment included getting some number of
equal-sized vectors and instead of working on them one at a time, assemble them
into a collection of vectors. Besides generalized lists, R supports dataframes
as a list of vectors (or actually other things) and matrices implemented as
vectors with attributes and some other structures.

The assignment seems to ask them to assemble the vectors into rows of a matrix
object and then use something that operates on one row at a time, as already
discussed. Clearly, someone has already made implementations that allow you to
avoid directly using apply() and get the same, faster result, using rowMeans.
But note that although there are perhaps additional rowwise operations available
this way, such as rowSums, in the general case, there are many possible
functions not already supplied.

If I want something like a rowMinMax, or something exotic where I consider the
row to contain the coefficients of a polynomial to evaluate for some given value
for X, then a less efficient solution can easily be put together using something
like apply, or if using other data structures, using cousins like
lapply/sapply/vapply and quite a few other methods in packages including
functions like pmap and interesting methods in dplyr such as rowwise that allow
per row calculations easily.

There are usually many ways you can do something in R, as in many languages, and
sometimes a highly efficient solution for some situations is not the best or
most efficient. An example might be how some forms of looping are better suited
when the number of items is not known in advance. Often a compiled language will
optimize away a loop over a fixed small number of items like asking for index
going from 1 to 2 to calculate the squares of the index and storing them in an
array of two. It may get replaced without a loop and calling a function twice,
or even doing the calculation at compile them and inserting a data structure
with the results and not doing anything major at run time.

In this case, the OP may actually have a small fixed number of such results and
possibly there are faster ways to do the calculation. But, if the point is to
learn how to do arbitrary calculations when the magnitude of the parts may vary,
then this is reasonable. Of course, if the samples being used are not all the
same length, this method can fail badly while other methods may still work well.

I note relatedly, that one of us here has expressed a wish to do some of their
code as near to using base R as possible rather than jump right away into the
tidyverse. I can respect that even as, I, personally, prefer using whatever
works for me faster and easier. At least, that is true when I am prototyping or
doing something one-off. If the code turns out to be used a lot and is slow, I
might reconsider. But, having said that, I suspect some packages may allow
faster code than the basics. Of course, things like the built-in native pipe
|> can change things so an altered package using a slower pipe may no longer
...

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rui Barradas
Sent: Saturday, March 15, 2025 5:28 PM
To: Kevin Zembower <kevin at zembower.org>; r-help at r-project.org
Subject: Re: [R] What don't I understand about sample()?

Hello,

I have been following this thread and though answers have been given, 
some of them address R coding in general, not necessarily the sample() 
function. Here are some random notes I think the OP could use, prompted 
by the text linked to, chap3.pdf.

1.
Throughout the text, assignments use the equal sign instead of the left 
arrow. The left arrow is generally considered more idiomatic and there 
is an important diference beteewn he wo, see help("assignOps").

time.mean = with(CommuteAtlanta, mean(Time))
B = 1000
n = nrow(CommuteAtlanta)

# This should be used, not the above.
time.mean <- with(CommuteAtlanta, mean(Time))
B <- 1000
n <- nrow(CommuteAtlanta)

2.
Systematic use of apply(., 1, mean).
rowMeans (and colMeans) are much faster.

boot.statistics <- apply(boot.samples, 1, mean)
boot.statistics <- rowMeans(boot.samples)

3.
The first confidence interval computation seems awkward. I had never 
seen this way of computing a CI95.
Moreover, it's plain common sense to keep the results with the returned 
decimals and round for display purposes only.
And the normal intervals are computed in a more usual way later in the 
text, see sections 1.2 and 1.3.

me <- ceiling(10 * 2 * time.se)/10
round(time.mean, 1) + c(-1, 1) * me

# Straightforward.
normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
normal_ci95
round(normal_ci95, 1)

# section 1.2 , function boot.mean
interval = mean(x) + c(-1,1)*2*se

# section 1.3
with(students, mean(Height) + c(-1, 1) * 2 * sd(result))

4.
In section 1.2 there is a bootstrap function boot.mean().
The function could be improved to let users pass a conf.level of their 
choice.
And why force the function user to always have the plot displayed? Base 
functions hist() and barplot() have na argument 'plot' with default TRUE
allowing the user to choose.

The following seems more user friendly.

boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot = 
TRUE) {
   require(ggplot2)
   n <- length(x)
   boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
   boot.statistics <- rowMeans(boot.samples)
   se <- sd(boot.statistics)
   if ( is.null(binwidth) )
     binwidth <- diff(range(boot.statistics))/30

   p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
     geom_histogram(aes(y = ..density..),binwidth = binwidth) +
     geom_density(color = "red")

   alpha <- 1 - (1 - conf.level)/2
   interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se

   if(plot) {
     plot(p)
   }

   list(boot.statistics = boot.statistics, interval = interval, se = se, 
plot = p)
}

Hope this helps,

Rui Barradas

?s 17:28 de 15/03/2025, Kevin Zembower via R-help
escreveu:> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
> 
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
> 
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
> 
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
> 
> Thanks, again, for helping me to further understand R and this problem.
> 
> -Kevin
> 
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
>> Not having the book (and which of the three editions are you using?),
>> I downloaded the data and played with it for a bit.
>> dotchart() showed the Dark and Light conditions looked quite
>> different, but also showed that there are not very many cases.
>> After trying t.test, it occurred to me that I did not know whether
>> "BMGain" means gain in *grams* or gain in *percent*.
>> Reflection told me that for a growth experiment, percent made more
>> sense, which reminded my of one of my first
>> student advising experiences, where I said "never give the
computer
>> percentages; let IT calculate the percentages
>> from the baseline and outcome, because once you've thrown away
>> information, the computer can't magically get it back."
>> In particular, in the real world I'd be worried about the
possibility
>> that there was some confounding going on, so I would
>> much rather have initial weight and final weight as variables.
>> If BMGain is an absolute measure, the p value for a t test is teeny
>> tiny.
>> If BMGain is a percentage, the p value for a sensible t test is about
>> 0.03.
>>
>> A permutation test went like this.
>> is.light <- d$Group == "Light"
>> is.dark <- d$Group == "Dark"
>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>> base.score <- score(d$BMGain)
>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>> sum(perm.scores >= base.score) / length(perm.scores)
>>
>> I don't actually see where matrix() comes into it, still less
>> anything
>> in the tidyverse.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2025 - What don't I understand about sample()?

[R] What don't I understand about sample()?

[R] What don't I understand about sample()?