Henrik Bengtsson
2010-Nov-03 17:54 UTC
[Rd] Using sample() to sample one value from a single value?
Hi, consider this one as an FYI, or a seed for further discussion. I am aware that many traps on sample() have been reported over the years. I know that these are also documents in help("sample"). Still I got bitten by this while writing sample(units, size=length(units)); where 'units' is an index (positive integer) vector. It works in all cases as expected (=I expect) expect for length(units) == 1. I know, it is well known. However, it got to make me wonder if it is possible to use sample() to draw a single value from a set containing only one value. I don't think so, unless you draw from a value that is <= 1. For instance, you can sample from c(10,10) by doing:> sample(rep(10, times=2), size=2);[1] 10 10 but you cannot sample from c(10) by doing:> sample(rep(10, times=1), size=1);[1] 9 unless you sample from a value <= 1, e.g. sample(rep(0.31, times=1), size=1); [1] 0.31 sample(rep(-10, times=1), size=1); [1] -10 Note also the related issue of sampling from a double vector of length 1, e.g.> sample(rep(1.2, times=2), size=2);[1] 1.2 1.2> sample(rep(1.2, times=1), size=1);[1] 1 I the latter case 1.2 is coerced to an integer. All of the above makes sense when one study the code of sample(), but sample() is indeed dangerous, e.g. imagine how many bootstrap estimates out there quietly gets incorrect. In order to cover all cases of length(units), including one, a solution is: sampleFrom <- function(x, size=length(x), ...) { n <- length(x); if (n == 1L) { res <- x; } else { res <- sample(x, size=size, ...); } res; } # sampleFrom()> sampleFrom(rep(10, times=2), size=2);[1] 10 10> sampleFrom(rep(10, times=1), size=1);[1] 10> sampleFrom(rep(0.31, times=1), size=1);[1] 0.31> sampleFrom(rep(-10, times=1), size=1);[1] -10> sampleFrom(rep(1.2, times=2), size=2);[1] 1.2 1.2> sampleFrom(rep(1.2, times=1), size=1);[1] 1.2 I want to add sampleFrom() to the wishlist of functions to be available in default R. Alternatively, one can add an argument 'sampleFrom=FALSE' to the existing sample() function. Eventually such an argument can be made TRUE by default. /Henrik
Henrique Dallazuanna
2010-Nov-03 18:02 UTC
[Rd] Using sample() to sample one value from a single value?
The resample function in the example section from sample help page does it or not? On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson <hb@biostat.ucsf.edu>wrote:> Hi, consider this one as an FYI, or a seed for further discussion. > > I am aware that many traps on sample() have been reported over the > years. I know that these are also documents in help("sample"). Still > I got bitten by this while writing > > sample(units, size=length(units)); > > where 'units' is an index (positive integer) vector. It works in all > cases as expected (=I expect) expect for length(units) == 1. I know, > it is well known. However, it got to make me wonder if it is possible > to use sample() to draw a single value from a set containing only one > value. I don't think so, unless you draw from a value that is <= 1. > > For instance, you can sample from c(10,10) by doing: > > > sample(rep(10, times=2), size=2); > [1] 10 10 > > but you cannot sample from c(10) by doing: > > > sample(rep(10, times=1), size=1); > [1] 9 > > unless you sample from a value <= 1, e.g. > > sample(rep(0.31, times=1), size=1); > [1] 0.31 > > sample(rep(-10, times=1), size=1); > [1] -10 > > Note also the related issue of sampling from a double vector of length 1, > e.g. > > > sample(rep(1.2, times=2), size=2); > [1] 1.2 1.2 > > sample(rep(1.2, times=1), size=1); > [1] 1 > > I the latter case 1.2 is coerced to an integer. > > All of the above makes sense when one study the code of sample(), but > sample() is indeed dangerous, e.g. imagine how many bootstrap > estimates out there quietly gets incorrect. > > > In order to cover all cases of length(units), including one, a solution is: > > sampleFrom <- function(x, size=length(x), ...) { > n <- length(x); > if (n == 1L) { > res <- x; > } else { > res <- sample(x, size=size, ...); > } > res; > } # sampleFrom() > > > sampleFrom(rep(10, times=2), size=2); > [1] 10 10 > > > sampleFrom(rep(10, times=1), size=1); > [1] 10 > > > sampleFrom(rep(0.31, times=1), size=1); > [1] 0.31 > > > sampleFrom(rep(-10, times=1), size=1); > [1] -10 > > > sampleFrom(rep(1.2, times=2), size=2); > [1] 1.2 1.2 > > > sampleFrom(rep(1.2, times=1), size=1); > [1] 1.2 > > > I want to add sampleFrom() to the wishlist of functions to be > available in default R. Alternatively, one can add an argument > 'sampleFrom=FALSE' to the existing sample() function. Eventually such > an argument can be made TRUE by default. > > /Henrik > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
Tim Hesterberg
2010-Nov-04 14:42 UTC
[Rd] Using sample() to sample one value from a single value?
On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson <hb at biostat.ucsf.edu>wrote:> Hi, consider this one as an FYI, or a seed for further discussion. > > I am aware that many traps on sample() have been reported over the > years. I know that these are also documents in help("sample"). Still > I got bitten by this while writing >... > All of the above makes sense when one study the code of sample(), but > sample() is indeed dangerous, e.g. imagine how many bootstrap > estimates out there quietly gets incorrect.Nonparametric bootstrapping from a sample of size 1 is <always> incorrect. If you draw a single observation from a sample of size 1, you get that same observation back. This implies zero sampling variability, which is wrong. If this single sample represents one stratum or sample in a larger problem, this would contribute zero variability to the overall result, again wrong. In general, the ordinary bootstrap underestimates variability in small samples. For a sample mean, the ordinary bootstrap corresponds to using an estimate of variance equal to (1/n) sum((x - mean(x))^2), instead of a divisor of n-1. In stratified and multi-sample applications the downward bias is similarly (n-1)/n. Three remedies are: * draw bootstrap samples of size n-1 * "bootknife" sampling - omit one observation (a jackknife sample), then draw a bootstrap sample of size n from that * bootstrap from a kernel density estimate, with kernel covariance equal to empirical covariance (with divisor n-1) / n. The latter two are described in Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930. http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf All three are undefined for samples of size 1. You need to go to some other bootstrap, e.g. a parametric bootstrap with variability estimated from other data. Tim Hesterberg
Henrik Bengtsson
2010-Nov-04 17:59 UTC
[Rd] Using sample() to sample one value from a single value?
Hi. On Thu, Nov 4, 2010 at 7:42 AM, Tim Hesterberg <timhesterberg at gmail.com> wrote:> On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson <hb at biostat.ucsf.edu>wrote: > >> Hi, consider this one as an FYI, or a seed for further discussion. >> >> I am aware that many traps on sample() have been reported over the >> years. ?I know that these are also documents in help("sample"). ?Still >> I got bitten by this while writing >>... >> All of the above makes sense when one study the code of sample(), but >> sample() is indeed dangerous, e.g. imagine how many bootstrap >> estimates out there quietly gets incorrect. > > Nonparametric bootstrapping from a sample of size 1 is <always> incorrect. > If you draw a single observation from a sample of size 1, you get that > same observation back. ?This implies zero sampling variability, which > is wrong. ?If this single sample represents one stratum or sample in > a larger problem, this would contribute zero variability to the overall > result, again wrong. > > In general, the ordinary bootstrap underestimates variability in > small samples. ?For a sample mean, the ordinary bootstrap corresponds > to using an estimate of variance equal to (1/n) sum((x - mean(x))^2), > instead of a divisor of n-1. ?In stratified and multi-sample applications > the downward bias is similarly (n-1)/n. > > Three remedies are: > * draw bootstrap samples of size n-1 > * "bootknife" sampling - omit one observation (a jackknife sample), then > ?draw a bootstrap sample of size n from that > * bootstrap from a kernel density estimate, with kernel covariance equal > ?to empirical covariance (with divisor n-1) / n. > The latter two are described in > Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. > Smoothing, Proceedings of the Section on Statistics and the Environment, > American Statistical Association, 2924-2930. > http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf > > All three are undefined for samples of size 1. ?You need to go to some > other bootstrap, e.g. a parametric bootstrap with variability estimated > from other data.I had a feeling that I was going to be bitten by that attention grabber on bootstrapping. Worse it may be misleading to some. But honestly, thank you Tim for pointing this out and so clearly explaining it all. /Henrik> > Tim Hesterberg > >