Dear All,
I was writing a small wrapper to bootstrap a classification algorithm, but if
we generate the indices in the "usual way" as:
bootindex <- sample(index, N, replace = TRUE)
there is a non-zero probability that all the samples belong to only
one class, thus leading to problems in the fitting (or that some classes will
end up with only one sample, which will be a problem for quadratic
discriminant analysis).
It thought this situation should be frequent enough to be mentioned in the
literature, but I have found almost no mention in the references I have
available, except for Hirst (see below). If I've reread correctly, this
issue
is not mentioned in Efron & Tibshirani (1997; the .632+ paper), or in Efron
and Gong (the TAS "leisure look" paper), or the Efron & Tibshirani
1993
bootstrap book, or Chernick's "Bootstrap methods" book. I've
only seen some
side mentions in Ripley's Pattern recognition (when talking about stratified
cross-validation), and Davison & Hinkley's bootstrap book when, on p.
304,
they refer to some subsets having singular design matrices, and thus
requiring stratification on covars. McLachlan (in his discriminant analysis
book), on p. 347, differentiates between mixture sampling and separate
sampling, but I can find a mention of what do when, under mixture sampling, we
end up with all samples in only one group.
Only Hirst (1996, Technometrics, 38 (4): 389--399) says that each bootstrap
sample should include at least one observation for each group, and at least
enough different observations from each group to allow estimation of the
covariance matrix (he is referring to discriminant analysis), and thus he
uses essentially stratified bootstrap samples.
Interestingly, the "boot" function (boot library) says "For
nonparametric
multi-sample problems stratified resampling is used.". As well, the
predab.resample (Design library) says "group: a grouping variable used to
stratify the sample upon bootstrapping. This allows one to handle k-sample
problems, (...)".
That the authors of boot and Design are using stratified resampling indicates
to me that this might be the obvious, unproblematic way to go, but I
understood that stratified resampling was OK only when that was sampling
scheme that generated the data.
What am I missing?
Thanks,
R.
--
Ram??n D??az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol??gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern??ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)