thr3ads.net - R help - [R] bootstrap: stratified resampling [Jun 2004]

If this information is useful, please help other people find it:
Share via:

Ramon Diaz-Uriarte

2004-Jun-08 16:48 UTC

[R] bootstrap: stratified resampling

Dear All,

I was writing a small wrapper to bootstrap a classification algorithm, but if 
we generate the indices in the "usual way" as:

bootindex <- sample(index, N, replace = TRUE)

there is a non-zero probability that all the samples belong to only 
one class, thus leading to problems in the fitting (or that some classes will 
end up with only one sample, which will be a problem for quadratic 
discriminant analysis).

It thought this situation should be frequent enough to be mentioned in the 
literature, but I have found almost no mention in the references I have 
available, except for Hirst (see below). If I've reread correctly, this
issue
is not mentioned in Efron & Tibshirani (1997; the .632+ paper), or in Efron 
and Gong (the TAS "leisure look" paper), or the Efron & Tibshirani
1993
bootstrap book, or Chernick's "Bootstrap methods" book. I've
only seen some
side mentions in Ripley's Pattern recognition (when talking about stratified
cross-validation), and Davison & Hinkley's bootstrap book when, on p.
304,
they refer to some subsets having singular design matrices, and thus 
requiring stratification on covars. McLachlan (in his discriminant analysis 
book), on p. 347, differentiates between mixture sampling and separate 
sampling, but I can find a mention of what do when, under mixture sampling, we 
end up with all samples in only one group.

Only Hirst (1996, Technometrics, 38 (4): 389--399) says that each bootstrap 
sample should include at least one observation for each group, and at least 
enough different observations from each group to allow estimation of the 
covariance matrix (he is referring to discriminant analysis), and thus he 
uses essentially stratified bootstrap samples.

Interestingly, the "boot" function (boot library) says "For
nonparametric
multi-sample problems stratified resampling is used.". As well, the 
predab.resample (Design library) says  "group: a grouping variable used to 
stratify the sample upon bootstrapping. This allows one to handle k-sample 
problems, (...)".

That the authors of boot and Design are using stratified resampling indicates 
to me that this might be the obvious, unproblematic way to go, but I 
understood that stratified resampling was OK only when that was sampling 
scheme that generated the data.  

What am I missing?

Thanks,

R.


-- 
Ram??n D??az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol??gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern??ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jun 2004 - bootstrap: stratified resampling

[R] bootstrap: stratified resampling

Reasonably Related Threads

Wisdom of the Ancients