Hi all: I've been looking at the boot package to "bootstrap" sample my data in a particular way. I haven't figured out how to set this up using the boot() command and thus have resorted to trying to write my own script (although I'd prefer if I could get boot() to work for this problem!) The dataset is set up in the following way: ix(factor) value 1 5.73 1 6.99 1 0.32 1 4.64 1 8.39 2 8.47 2 1.04 2 0.73 2 0.29 3 6.82 3 8.81 3 1.33 3 9.17 3 9.84 4 8.57 4 5.04 4 7.18 4 4.54 4 4.37 5 7.36 5 4.97 5 2.66 What I would like to do is repeatedly sample the ix (a factor), not the individual rows. For example, say I wanted to repeatedly sample (at a sample size of 3) the ix value - e.g. 1,3,5 then average the "value"s within those factors and then lets say take the median across this each. So for a random sample of (1,3,5) that would be: median(c(mean(c(5.73,6.99,0.32,4.64,8.39)), mean(6.82,8.81,1.33,9.17,9.84), mean(7.36,4.97,2.66))) Then repeat this over combinations of 3 ix factors e.g. (1,2,3), (1,1,4), etc... Is it possible to subsample a factor using boot() and then use that sample of factors to access rows, rather than directly sample rows? Thanks!!! -Scott
Thomas W Blackwell
2003-Nov-11 03:43 UTC
[R] boot package question: sampling on factor, not row
Scott - The second argument to boot(), called 'statistic', can be any user-written function you want to cook up, with additional arguments being passed to it through the '...' mechanism after all of the named arguments. (See: `R-intro `Writing your own functions `The ellipsis argument for details.) To carry out your example, I would do something like the following: (not tested ! use at your own risk.) my.summary <- function(data, groups, ix, value) { median(aggregate(value, list(ix), mean)[groups[seq(3)]]) } library("boot") result <- boot(seq(along=levels(ix)), my.summary, 10000, ix=ix, value=value) You will note that what boot() thinks is the "data" in the example here is only a vector of sequential integers the same length as levels(ix). This data is ignored in my.summary() and the two columns which you show as "ix" and "value" are used instead. Furthermore, unless I misunderstand your example, the mean within each level of "ix" is invariant to which three levels have been chosen for this particular bootstrap replicate. Therefore, you could call aggregate() only once rather than 10000 times, if you rewrite the function my.summary() to use the result of aggregate() rather than call it afresh on every iteration. I've given you the reference for the '...' mechanism, because that reference is almost impossible to find using help.search(). For the rest of the functions I've used, you're on your own to look up their help pages. I *will* comment that I can't see why this particular statistic is of interest . . . but, I assume you have your own reasons. HTH - tom blackwell - u michigan medical school - ann arbor - On Mon, 10 Nov 2003, Scott Norton wrote:> Hi all: > > I've been looking at the boot package to "bootstrap" sample > my data in a particular way. I haven't figured out how to > set this up using the boot() command and thus have resorted > to trying to write my own script (although I'd prefer if I > could get boot() to work for this problem!) > > The dataset is set up in the following way: > > ix(factor) value > 1 5.73 > 1 6.99 > 1 0.32 > 1 4.64 > 1 8.39 > 2 8.47 > 2 1.04 > 2 0.73 > 2 0.29 > 3 6.82 > 3 8.81 > 3 1.33 > 3 9.17 > 3 9.84 > 4 8.57 > 4 5.04 > 4 7.18 > 4 4.54 > 4 4.37 > 5 7.36 > 5 4.97 > 5 2.66 > > What I would like to do is repeatedly sample the ix (a factor), > not the individual rows. For example, say I wanted to repeatedly > sample (at a sample size of 3) the ix value - e.g. 1,3,5 - then > average the "value"s within those factors and then lets say take > the median across this each. > > So for a random sample of (1,3,5) that would be: > > median(c(mean(c(5.73,6.99,0.32,4.64,8.39)), > mean(6.82,8.81,1.33,9.17,9.84), > mean(7.36,4.97,2.66))) > > Then repeat this over combinations of 3 ix factors e.g. (1,2,3), > (1,1,4), etc... > > Is it possible to subsample a factor using boot() and then use > that sample of factors to access rows, rather than directly sample > rows? > > Thanks!!! > -Scott >
Thomas W Blackwell
2003-Nov-11 13:46 UTC
[R] boot package question: sampling on factor, not row
> On Mon, 10 Nov 2003, Thomas W Blackwell wrote: > > > The second argument to boot(), called 'statistic', can be > > any user-written function you want to cook up, with additional > > arguments being passed to it through the '...' mechanism after > > all of the named arguments. (See: `R-intro `Writing your own > > functions `The ellipsis argument for details.) > > > I've given you the reference for the '...' mechanism, because > > that reference is almost impossible to find using help.search(). >On Tue, 11 Nov 2003, Prof Brian Ripley wrote:> Right, as help.search `allows for searching the help system'. It does not > search the manuals, nor the FAQs, so it would be imposible to find things > not in the help system. > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 >Precisely my point, Brian. The usage and meaning of '...' are almost impossible to find in the help system. Could there be a help page for it ? Questions about '...' are reasonably frequently asked on this list. While we're at it, what could be done so that help.search("logistic") returns a reference to glm() and help.search("regression") returns references to both lm() and glm() ? - tom blackwell - u michigan medical school - ann arbor -