thr3ads.net - R help - [R] What don't I understand about sample()? [Mar 2025]

If this information is useful, please help other people find it:
Share via:

Richard O'Keefe

2025-Mar-16 04:32 UTC

[R] What don't I understand about sample()?

Rgui 4.4.3 on Windows.  When I start it up, read.csv is just *there*.
I don't need to load any package to get it.

I have three reasons for being very sparing in the packages I use.
1. It took me long enough to get my head around R.  More packages more things to
learn.  I *still* have major trouble grasping
tidyverse, and as far as I can see it doesn't solve any problem that
*I* have.  I install a package only when I have a specific need for
something it does, like spatial statistics.  (And yet I have hundreds
of packages installed, because packages depend on other packages.)
2. Everything changes, and they don't all change coherently.  A
package I've used for years may not be available in the next release.
This is not a theoretical possibility; it has happened to me often.
"If I don't use it I can't lose it."  Sometimes things break
because
something else on the system (tcl/tk, or the C or Fortran compiler)
has changed.  I'm tired of things breaking because the C or Fortran compiler
is now stricter.
3. The universe of R packages is vast and constantly expanding.  This
makes it *impossible* for anyone to test every possible combination.  I
used to teach software engineering, and we had a slogan "if it isn't
tested it doesn't work".  Base R plus package X?  Probably tested.
Base R plus package Y?  Probably tested.  Base R plus X plus Y?
Not unless X requires Y or Y requires X.

There is also the didactic point that the more you work with base R
the better you will understand it, which you will need to understand
other things like tidyverse.  It's like mastering the alphabet before you
learn shorthand.


On Sun, 16 Mar 2025 at 06:55, <avi.e.gross at gmail.com>
wrote:>
> Kevin & Richard, and of course everyone,
>
> As the main topic here is not the tidyverse, I will mention the perils of
loading in more than needed in general.
>
> If you want to use one or a very few functions, it can be more efficient
and safe to load exactly what is needed. In the case of wanting to use
read_csv(), I think this suffices:
>
> library(readr)
>
> If you instead use:
>
> library(tidyverse)
>
> You load a varying number of packages (it may change) including some like
lubridate or forcats or ggplot2 that you may not be even thinking of using or
never heard of.
>
> The bigger problem is shadowing that happens. For example, you may be
getting warning messages like:
>
> ? dplyr::filter() masks stats::filter()
> ? dplyr::lag()    masks stats::lag()
>
> This can interfere with some other package you had already loaded unless it
uses a notation like mypackage::filter(...) in their code to avoid being easily
replaced but even then, if you yourself called what you though was filter() from
base R or some package, you have a problem unless you invoke it like
base::filter(...)
>
> The order packages like this load can matter as well as when you define a
function of your own. So, it may be worth some effort to zoom in and call
exactly what you need and only when you need it. I have seen code that only
needs a package in rare conditions and only loads the package in one branch of
an IF statement right before using in.
> .
> Packages can also be unloaded after use.
>
> From what you describe, none of this is crucially important as you are
using R for your own purposes in your own RMarkDown file that you may not be
distributing. And, when I write programs where I keep adjusting and adding
things from the tidyverse, it is indeed much easier to just get the grouping on
top and forget about it. That is, until I decide to do something with functional
programming that uses reduce/filter/map... and have an odd error!
>
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Kevin
Zembower via R-help
> Sent: Saturday, March 15, 2025 1:29 PM
> To: r-help at r-project.org
> Subject: Re: [R] What don't I understand about sample()?
>
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
>
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
>
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
>
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
>
> Thanks, again, for helping me to further understand R and this problem.
>
> -Kevin
>
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
> > Not having the book (and which of the three editions are you using?),
> > I downloaded the data and played with it for a bit.
> > dotchart() showed the Dark and Light conditions looked quite
> > different, but also showed that there are not very many cases.
> > After trying t.test, it occurred to me that I did not know whether
> > "BMGain" means gain in *grams* or gain in *percent*.
> > Reflection told me that for a growth experiment, percent made more
> > sense, which reminded my of one of my first
> > student advising experiences, where I said "never give the
computer
> > percentages; let IT calculate the percentages
> > from the baseline and outcome, because once you've thrown away
> > information, the computer can't magically get it back."
> > In particular, in the real world I'd be worried about the
possibility
> > that there was some confounding going on, so I would
> > much rather have initial weight and final weight as variables.
> > If BMGain is an absolute measure, the p value for a t test is teeny
> > tiny.
> > If BMGain is a percentage, the p value for a sensible t test is about
> > 0.03.
> >
> > A permutation test went like this.
> > is.light <- d$Group == "Light"
> > is.dark <- d$Group == "Dark"
> > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > base.score <- score(d$BMGain)
> > perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> > sum(perm.scores >= base.score) / length(perm.scores)
> >
> > I don't actually see where matrix() comes into it, still less
> > anything
> > in the tidyverse.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

@vi@e@gross m@iii@g oii gm@ii@com

2025-Mar-16 04:51 UTC

head link

[R] What don't I understand about sample()?

Richard,

The function with a period as a separator that you cite, read.csv, is part of
normal base R.

We have been discussing a different function named just a tad different that
uses an underscore as a separator, read_csv that is similar but has some changes
in how it works and the options supported and is considered part of the
tidyverse grouping of packages and can also be gotten more compactly by
importing package "readr" ...

The OP, for reasons of their own, wanted to use read_csv and did not want or
need anything else in the related packages.

Of course, nobody is required to use other packages, albeit, as you noted, many
packages you may choose to use have some dependencies on others you don't.

Like many good things, added functionality available to you does add complexity
and room for failures. But when a package is useful enough to be very useful, it
can develop enough momentum that some functionality might well be a good idea to
move into base R. As an example I already mentioned, of the various pipe
implementations, a version has been added to base R and I suspect many older
packages, including in the tidyverse, can adjust their code in new releases to
use it but with CARE. Anyone still using older versions of R will experience
failures in such a scenario.

Luckily, many uses within a package are likely to be safe if done properly. Can
anyone share if any such methods are in use?

I mean, as an example, could a package early on check if the R version being
used is later than the introduction, or some other way to check if a |>
operation is supported? Could they then somehow introduce an operator that is
either bound to |> or perhaps %>% and use that in any places in the code
where both work the same, and only use the magrittr pipe when doing something it
does differently such as needing to use a period to specify which argument in a
function is receiving the pipelined data.

There are programs people want to keep frozen so they only use the versions of R
and packages that existed at some moment so you avoid some inevitable conflicts.
So, I despair that older versions of R may stick around way too long and break
with any newer packages.

But languages cannot remain totally static or chances are people will move on to
newer languages that offer things they want. Then again, there seem to still be
COBOL programs out there.

-----Original Message-----
From: Richard O'Keefe <raoknz at gmail.com> 
Sent: Sunday, March 16, 2025 12:32 AM
To: avi.e.gross at gmail.com
Cc: Kevin Zembower <kevin at zembower.org>; r-help at r-project.org
Subject: Re: [R] What don't I understand about sample()?

Rgui 4.4.3 on Windows.  When I start it up, read.csv is just *there*.
I don't need to load any package to get it.

I have three reasons for being very sparing in the packages I use.
1. It took me long enough to get my head around R.  More packages more things to
learn.  I *still* have major trouble grasping
tidyverse, and as far as I can see it doesn't solve any problem that
*I* have.  I install a package only when I have a specific need for
something it does, like spatial statistics.  (And yet I have hundreds
of packages installed, because packages depend on other packages.)
2. Everything changes, and they don't all change coherently.  A
package I've used for years may not be available in the next release.
This is not a theoretical possibility; it has happened to me often.
"If I don't use it I can't lose it."  Sometimes things break
because
something else on the system (tcl/tk, or the C or Fortran compiler)
has changed.  I'm tired of things breaking because the C or Fortran compiler
is now stricter.
3. The universe of R packages is vast and constantly expanding.  This
makes it *impossible* for anyone to test every possible combination.  I
used to teach software engineering, and we had a slogan "if it isn't
tested it doesn't work".  Base R plus package X?  Probably tested.
Base R plus package Y?  Probably tested.  Base R plus X plus Y?
Not unless X requires Y or Y requires X.

There is also the didactic point that the more you work with base R
the better you will understand it, which you will need to understand
other things like tidyverse.  It's like mastering the alphabet before you
learn shorthand.

On Sun, 16 Mar 2025 at 06:55, <avi.e.gross at gmail.com>
wrote:>
> Kevin & Richard, and of course everyone,
>
> As the main topic here is not the tidyverse, I will mention the perils of
loading in more than needed in general.
>
> If you want to use one or a very few functions, it can be more efficient
and safe to load exactly what is needed. In the case of wanting to use
read_csv(), I think this suffices:
>
> library(readr)
>
> If you instead use:
>
> library(tidyverse)
>
> You load a varying number of packages (it may change) including some like
lubridate or forcats or ggplot2 that you may not be even thinking of using or
never heard of.
>
> The bigger problem is shadowing that happens. For example, you may be
getting warning messages like:
>
> ? dplyr::filter() masks stats::filter()
> ? dplyr::lag()    masks stats::lag()
>
> This can interfere with some other package you had already loaded unless it
uses a notation like mypackage::filter(...) in their code to avoid being easily
replaced but even then, if you yourself called what you though was filter() from
base R or some package, you have a problem unless you invoke it like
base::filter(...)
>
> The order packages like this load can matter as well as when you define a
function of your own. So, it may be worth some effort to zoom in and call
exactly what you need and only when you need it. I have seen code that only
needs a package in rare conditions and only loads the package in one branch of
an IF statement right before using in.
> .
> Packages can also be unloaded after use.
>
> From what you describe, none of this is crucially important as you are
using R for your own purposes in your own RMarkDown file that you may not be
distributing. And, when I write programs where I keep adjusting and adding
things from the tidyverse, it is indeed much easier to just get the grouping on
top and forget about it. That is, until I decide to do something with functional
programming that uses reduce/filter/map... and have an odd error!
>
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Kevin
Zembower via R-help
> Sent: Saturday, March 15, 2025 1:29 PM
> To: r-help at r-project.org
> Subject: Re: [R] What don't I understand about sample()?
>
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
>
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
>
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
>
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
>
> Thanks, again, for helping me to further understand R and this problem.
>
> -Kevin
>
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
> > Not having the book (and which of the three editions are you using?),
> > I downloaded the data and played with it for a bit.
> > dotchart() showed the Dark and Light conditions looked quite
> > different, but also showed that there are not very many cases.
> > After trying t.test, it occurred to me that I did not know whether
> > "BMGain" means gain in *grams* or gain in *percent*.
> > Reflection told me that for a growth experiment, percent made more
> > sense, which reminded my of one of my first
> > student advising experiences, where I said "never give the
computer
> > percentages; let IT calculate the percentages
> > from the baseline and outcome, because once you've thrown away
> > information, the computer can't magically get it back."
> > In particular, in the real world I'd be worried about the
possibility
> > that there was some confounding going on, so I would
> > much rather have initial weight and final weight as variables.
> > If BMGain is an absolute measure, the p value for a t test is teeny
> > tiny.
> > If BMGain is a percentage, the p value for a sensible t test is about
> > 0.03.
> >
> > A permutation test went like this.
> > is.light <- d$Group == "Light"
> > is.dark <- d$Group == "Dark"
> > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > base.score <- score(d$BMGain)
> > perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> > sum(perm.scores >= base.score) / length(perm.scores)
> >
> > I don't actually see where matrix() comes into it, still less
> > anything
> > in the tidyverse.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

@vi@e@gross m@iii@g oii gm@ii@com

2025-Mar-16 05:12 UTC

head link

[R] What don't I understand about sample()?

Richard,

After some thinking, I conclude that founders of things, including ancient
religious figures as well as people who have developed programming languages,
may not recognize what follows later. As an example, you have customs in some
religions revolving around trees and snow or around potato pancakes when the
early version took place in warmer climes and long before potatoes had been
introduced to Eurasia.

In the same way, many people would NOT build the language the same way if they
were doing it now. They would not be stuck with some early, and sometimes
arbitrary, decisions.

I am not defending the tidyverse nor attacking it. Some parts of it have kept
changing with others being deprecated and if built anew, it might be a bit
different. But, yes, in one sense it is a language of it's own tacked on top
of base R which is tacked on top of a sparser underlying part of R. I happen to
like quite a bit of it as having a certain consistence and approach and
functionality that is better thought out than early pure R. Not everything, but
some things. And, the two largely inter-operate so it is possible to do tasks
alternating between the two as well as incorporating other packages.

There are parts of R that I, as an opinion, feel were badly designed and built
in the sense that it seems complex and easy to do things wrong in. Others
disagree. In a similar vein, there are parts of the tidyverse that don't
feel right to me and some of what has been deprecated feels more right. As time
went on, some simpler functionality has been made so damn general and flexible
that the simpler and more common cases feel almost hidden and hard to do.

A new approach to build an R-like language might, as one example, move directly
to one major way of being object-oriented rather than so many available now.
Attempts to create a sort of overarching way may work but are perhaps more
kludge than had they been built directly with no S3 or S4 or ...

But it is in a sense too late to change and also too early. A new R might be so
incompatible that many or most programs would stop working right, or at all.
Python is an example where they bit the bullet and version 3.X is different
enough from 2.X so that many programs need a serious rewrite. Too bad the
original could not have started out like it, but who knew what the future might
bring.

But I find a reasonable compromise reasonable if you allow a language to layer
mini-languages within or above it. An obvious example is how regular expressions
can extend a language. Some languages like SCALA allow fairly serious ways to
extend the language almost seamlessly as you can create what looks like all
kinds of new keywords and operators that just work. When working with a region
of code using objects and associated methods that appear inline between objects,
you might wonder if you are using some other language entirely. Move to a
different part of the code, and it may look like yet another universe.

R is not designed with all concepts but packages often can extend it nicely and
I suggest that is as good a way to extend the language as many others, albeit
things can break.

-----Original Message-----
From: Richard O'Keefe <raoknz at gmail.com> 
Sent: Sunday, March 16, 2025 12:32 AM
To: avi.e.gross at gmail.com
Cc: Kevin Zembower <kevin at zembower.org>; r-help at r-project.org
Subject: Re: [R] What don't I understand about sample()?

Rgui 4.4.3 on Windows.  When I start it up, read.csv is just *there*.
I don't need to load any package to get it.

I have three reasons for being very sparing in the packages I use.
1. It took me long enough to get my head around R.  More packages more things to
learn.  I *still* have major trouble grasping
tidyverse, and as far as I can see it doesn't solve any problem that
*I* have.  I install a package only when I have a specific need for
something it does, like spatial statistics.  (And yet I have hundreds
of packages installed, because packages depend on other packages.)
2. Everything changes, and they don't all change coherently.  A
package I've used for years may not be available in the next release.
This is not a theoretical possibility; it has happened to me often.
"If I don't use it I can't lose it."  Sometimes things break
because
something else on the system (tcl/tk, or the C or Fortran compiler)
has changed.  I'm tired of things breaking because the C or Fortran compiler
is now stricter.
3. The universe of R packages is vast and constantly expanding.  This
makes it *impossible* for anyone to test every possible combination.  I
used to teach software engineering, and we had a slogan "if it isn't
tested it doesn't work".  Base R plus package X?  Probably tested.
Base R plus package Y?  Probably tested.  Base R plus X plus Y?
Not unless X requires Y or Y requires X.

There is also the didactic point that the more you work with base R
the better you will understand it, which you will need to understand
other things like tidyverse.  It's like mastering the alphabet before you
learn shorthand.

On Sun, 16 Mar 2025 at 06:55, <avi.e.gross at gmail.com>
wrote:>
> Kevin & Richard, and of course everyone,
>
> As the main topic here is not the tidyverse, I will mention the perils of
loading in more than needed in general.
>
> If you want to use one or a very few functions, it can be more efficient
and safe to load exactly what is needed. In the case of wanting to use
read_csv(), I think this suffices:
>
> library(readr)
>
> If you instead use:
>
> library(tidyverse)
>
> You load a varying number of packages (it may change) including some like
lubridate or forcats or ggplot2 that you may not be even thinking of using or
never heard of.
>
> The bigger problem is shadowing that happens. For example, you may be
getting warning messages like:
>
> ? dplyr::filter() masks stats::filter()
> ? dplyr::lag()    masks stats::lag()
>
> This can interfere with some other package you had already loaded unless it
uses a notation like mypackage::filter(...) in their code to avoid being easily
replaced but even then, if you yourself called what you though was filter() from
base R or some package, you have a problem unless you invoke it like
base::filter(...)
>
> The order packages like this load can matter as well as when you define a
function of your own. So, it may be worth some effort to zoom in and call
exactly what you need and only when you need it. I have seen code that only
needs a package in rare conditions and only loads the package in one branch of
an IF statement right before using in.
> .
> Packages can also be unloaded after use.
>
> From what you describe, none of this is crucially important as you are
using R for your own purposes in your own RMarkDown file that you may not be
distributing. And, when I write programs where I keep adjusting and adding
things from the tidyverse, it is indeed much easier to just get the grouping on
top and forget about it. That is, until I decide to do something with functional
programming that uses reduce/filter/map... and have an odd error!
>
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Kevin
Zembower via R-help
> Sent: Saturday, March 15, 2025 1:29 PM
> To: r-help at r-project.org
> Subject: Re: [R] What don't I understand about sample()?
>
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
>
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
>
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
>
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
>
> Thanks, again, for helping me to further understand R and this problem.
>
> -Kevin
>
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request at r-project.org wrote:
> > Not having the book (and which of the three editions are you using?),
> > I downloaded the data and played with it for a bit.
> > dotchart() showed the Dark and Light conditions looked quite
> > different, but also showed that there are not very many cases.
> > After trying t.test, it occurred to me that I did not know whether
> > "BMGain" means gain in *grams* or gain in *percent*.
> > Reflection told me that for a growth experiment, percent made more
> > sense, which reminded my of one of my first
> > student advising experiences, where I said "never give the
computer
> > percentages; let IT calculate the percentages
> > from the baseline and outcome, because once you've thrown away
> > information, the computer can't magically get it back."
> > In particular, in the real world I'd be worried about the
possibility
> > that there was some confounding going on, so I would
> > much rather have initial weight and final weight as variables.
> > If BMGain is an absolute measure, the p value for a t test is teeny
> > tiny.
> > If BMGain is a percentage, the p value for a sensible t test is about
> > 0.03.
> >
> > A permutation test went like this.
> > is.light <- d$Group == "Light"
> > is.dark <- d$Group == "Dark"
> > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > base.score <- score(d$BMGain)
> > perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> > sum(perm.scores >= base.score) / length(perm.scores)
> >
> > I don't actually see where matrix() comes into it, still less
> > anything
> > in the tidyverse.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2025 - What don't I understand about sample()?

[R] What don't I understand about sample()?

[R] What don't I understand about sample()?

[R] What don't I understand about sample()?