thr3ads.net - R help - [R] What don't I understand about sample()? [Mar 2025]

If this information is useful, please help other people find it:
Share via:

Kevin Zembower

2025-Mar-15 17:28 UTC

[R] What don't I understand about sample()?

Hi, Richard, thanks for replying. I should have mentioned the third
edition, which we're using. The data file didn't change between the
second and third editions, and the data on Body Mass Gain was the same
as in the first edition, although the first edition data file contained
additional variables.

According to my text, the BMGain was measured in grams. Thanks for
pointing out that my statement of the problem lacked crucial
information.

The matrix in my example comes from an example in
https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
created a bootstrap example with a matrix that consisted of one row for
every sample in the bootstrap, and one column for each mean in the
original data. This allowed him to find the mean for each row to create
the bootstrap statistics.

The only need for the tidyverse is to use the read_csv() function. I'm
regrettably lazy in not determining which of the multiple functions in
the tidyverse library loads read_csv(), and just using that one.

Thanks, again, for helping me to further understand R and this problem.

-Kevin

On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org
wrote:> Not having the book (and which of the three editions are you using?),
> I downloaded the data and played with it for a bit.
> dotchart() showed the Dark and Light conditions looked quite
> different, but also showed that there are not very many cases.
> After trying t.test, it occurred to me that I did not know whether
> "BMGain" means gain in *grams* or gain in *percent*.
> Reflection told me that for a growth experiment, percent made more
> sense, which reminded my of one of my first
> student advising experiences, where I said "never give the computer
> percentages; let IT calculate the percentages
> from the baseline and outcome, because once you've thrown away
> information, the computer can't magically get it back."
> In particular, in the real world I'd be worried about the possibility
> that there was some confounding going on, so I would
> much rather have initial weight and final weight as variables.
> If BMGain is an absolute measure, the p value for a t test is teeny
> tiny.
> If BMGain is a percentage, the p value for a sensible t test is about
> 0.03.
> 
> A permutation test went like this.
> is.light <- d$Group == "Light"
> is.dark <- d$Group == "Dark"
> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> base.score <- score(d$BMGain)
> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> sum(perm.scores >= base.score) / length(perm.scores)
> 
> I don't actually see where matrix() comes into it, still less
> anything
> in the tidyverse.
>

@vi@e@gross m@iii@g oii gm@ii@com

2025-Mar-15 17:55 UTC

head link

[R] What don't I understand about sample()?

Kevin & Richard, and of course everyone,

As the main topic here is not the tidyverse, I will mention the perils of
loading in more than needed in general.

If you want to use one or a very few functions, it can be more efficient and
safe to load exactly what is needed. In the case of wanting to use read_csv(), I
think this suffices:

library(readr)

If you instead use:

library(tidyverse)

You load a varying number of packages (it may change) including some like
lubridate or forcats or ggplot2 that you may not be even thinking of using or
never heard of.

The bigger problem is shadowing that happens. For example, you may be getting
warning messages like:

? dplyr::filter() masks stats::filter()
? dplyr::lag()    masks stats::lag()

This can interfere with some other package you had already loaded unless it uses
a notation like mypackage::filter(...) in their code to avoid being easily
replaced but even then, if you yourself called what you though was filter() from
base R or some package, you have a problem unless you invoke it like
base::filter(...)

The order packages like this load can matter as well as when you define a
function of your own. So, it may be worth some effort to zoom in and call
exactly what you need and only when you need it. I have seen code that only
needs a package in rare conditions and only loads the package in one branch of
an IF statement right before using in.
.
Packages can also be unloaded after use.
>From what you describe, none of this is crucially important as you are using
R for your own purposes in your own RMarkDown file that you may not be
distributing. And, when I write programs where I keep adjusting and adding
things from the tidyverse, it is indeed much easier to just get the grouping on
top and forget about it. That is, until I decide to do something with functional
programming that uses reduce/filter/map... and have an odd error!

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Kevin Zembower
via R-help
Sent: Saturday, March 15, 2025 1:29 PM
To: r-help at r-project.org
Subject: Re: [R] What don't I understand about sample()?

Hi, Richard, thanks for replying. I should have mentioned the third
edition, which we're using. The data file didn't change between the
second and third editions, and the data on Body Mass Gain was the same
as in the first edition, although the first edition data file contained
additional variables.

According to my text, the BMGain was measured in grams. Thanks for
pointing out that my statement of the problem lacked crucial
information.

The matrix in my example comes from an example in
https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
created a bootstrap example with a matrix that consisted of one row for
every sample in the bootstrap, and one column for each mean in the
original data. This allowed him to find the mean for each row to create
the bootstrap statistics.

The only need for the tidyverse is to use the read_csv() function. I'm
regrettably lazy in not determining which of the multiple functions in
the tidyverse library loads read_csv(), and just using that one.

Thanks, again, for helping me to further understand R and this problem.

-Kevin

On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org
wrote:> Not having the book (and which of the three editions are you using?),
> I downloaded the data and played with it for a bit.
> dotchart() showed the Dark and Light conditions looked quite
> different, but also showed that there are not very many cases.
> After trying t.test, it occurred to me that I did not know whether
> "BMGain" means gain in *grams* or gain in *percent*.
> Reflection told me that for a growth experiment, percent made more
> sense, which reminded my of one of my first
> student advising experiences, where I said "never give the computer
> percentages; let IT calculate the percentages
> from the baseline and outcome, because once you've thrown away
> information, the computer can't magically get it back."
> In particular, in the real world I'd be worried about the possibility
> that there was some confounding going on, so I would
> much rather have initial weight and final weight as variables.
> If BMGain is an absolute measure, the p value for a t test is teeny
> tiny.
> If BMGain is a percentage, the p value for a sensible t test is about
> 0.03.
> 
> A permutation test went like this.
> is.light <- d$Group == "Light"
> is.dark <- d$Group == "Dark"
> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> base.score <- score(d$BMGain)
> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> sum(perm.scores >= base.score) / length(perm.scores)
> 
> I don't actually see where matrix() comes into it, still less
> anything
> in the tidyverse.
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2025-Mar-15 21:27 UTC

head link

[R] What don't I understand about sample()?

Hello,

I have been following this thread and though answers have been given, 
some of them address R coding in general, not necessarily the sample() 
function. Here are some random notes I think the OP could use, prompted 
by the text linked to, chap3.pdf.

1.
Throughout the text, assignments use the equal sign instead of the left 
arrow. The left arrow is generally considered more idiomatic and there 
is an important diference beteewn he wo, see help("assignOps").


time.mean = with(CommuteAtlanta, mean(Time))
B = 1000
n = nrow(CommuteAtlanta)

# This should be used, not the above.
time.mean <- with(CommuteAtlanta, mean(Time))
B <- 1000
n <- nrow(CommuteAtlanta)


2.
Systematic use of apply(., 1, mean).
rowMeans (and colMeans) are much faster.


boot.statistics <- apply(boot.samples, 1, mean)
boot.statistics <- rowMeans(boot.samples)


3.
The first confidence interval computation seems awkward. I had never 
seen this way of computing a CI95.
Moreover, it's plain common sense to keep the results with the returned 
decimals and round for display purposes only.
And the normal intervals are computed in a more usual way later in the 
text, see sections 1.2 and 1.3.


me <- ceiling(10 * 2 * time.se)/10
round(time.mean, 1) + c(-1, 1) * me

# Straightforward.
normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
normal_ci95
round(normal_ci95, 1)

# section 1.2 , function boot.mean
interval = mean(x) + c(-1,1)*2*se

# section 1.3
with(students, mean(Height) + c(-1, 1) * 2 * sd(result))



4.
In section 1.2 there is a bootstrap function boot.mean().
The function could be improved to let users pass a conf.level of their 
choice.
And why force the function user to always have the plot displayed? Base 
functions hist() and barplot() have na argument 'plot' with default TRUE
allowing the user to choose.

The following seems more user friendly.


boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot = 
TRUE) {
   require(ggplot2)
   n <- length(x)
   boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
   boot.statistics <- rowMeans(boot.samples)
   se <- sd(boot.statistics)
   if ( is.null(binwidth) )
     binwidth <- diff(range(boot.statistics))/30

   p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
     geom_histogram(aes(y = ..density..),binwidth = binwidth) +
     geom_density(color = "red")

   alpha <- 1 - (1 - conf.level)/2
   interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se

   if(plot) {
     plot(p)
   }

   list(boot.statistics = boot.statistics, interval = interval, se = se, 
plot = p)
}


Hope this helps,

Rui Barradas



?s 17:28 de 15/03/2025, Kevin Zembower via R-help
escreveu:> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
> 
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
> 
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
> 
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
> 
> Thanks, again, for helping me to further understand R and this problem.
> 
> -Kevin
> 
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
>> Not having the book (and which of the three editions are you using?),
>> I downloaded the data and played with it for a bit.
>> dotchart() showed the Dark and Light conditions looked quite
>> different, but also showed that there are not very many cases.
>> After trying t.test, it occurred to me that I did not know whether
>> "BMGain" means gain in *grams* or gain in *percent*.
>> Reflection told me that for a growth experiment, percent made more
>> sense, which reminded my of one of my first
>> student advising experiences, where I said "never give the
computer
>> percentages; let IT calculate the percentages
>> from the baseline and outcome, because once you've thrown away
>> information, the computer can't magically get it back."
>> In particular, in the real world I'd be worried about the
possibility
>> that there was some confounding going on, so I would
>> much rather have initial weight and final weight as variables.
>> If BMGain is an absolute measure, the p value for a t test is teeny
>> tiny.
>> If BMGain is a percentage, the p value for a sensible t test is about
>> 0.03.
>>
>> A permutation test went like this.
>> is.light <- d$Group == "Light"
>> is.dark <- d$Group == "Dark"
>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>> base.score <- score(d$BMGain)
>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>> sum(perm.scores >= base.score) / length(perm.scores)
>>
>> I don't actually see where matrix() comes into it, still less
>> anything
>> in the tidyverse.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

R help - Mar 2025 - What don't I understand about sample()?

[R] What don't I understand about sample()?

[R] What don't I understand about sample()?

[R] What don't I understand about sample()?