On Dec 19, 2010, at 5:31 AM, Tom Wilding wrote:
> Dear Mailing List
>
> I have a data set (data4) consisting of a number of factors and a
> response variable. I wish to randomly sample from a combination of
> two of those factors (GIS_station and Distance_code2) and return a
> new dataframe containing the original data structure (i.e. all the
> columns) but only containing the randomly selected rows. The number
> of rows in each combination of GIS_station and Distance_code2 vary
> (widely) and some combinations are absent.
>
> This is getting there::
> with (data4,{
> sub_sample10=by(data4,list(GIS_station,Distance_code2), function(x)
> {sample(1:nrow(x),10,replace=T)})
> })
>
> ....but just generates two random numbers from the range 1:nrow(x).
Only 2? Your argument to sample is 10.
> It doesn't return the selected rows, which is what I want.
And those row numbers would not refer to the order in the original
sample either but would be referring within the . You have not yet
done a very good job of specifying what sampling strategy is needed.
At the moment you seem to be working toward a strategy that would
potentially be very uneven in terms of the probabilities that members
of different combinations would get into the sample, since the number
being chosen is fixed and the number to be chosen from "varies
widely". Is that really what you want?
>
> I'm sure I could this could be done in an elegant manner, using a
> subscript e.g.
>
> sub_sample10 = data4 [sample (1:nrow (data4), size=10), ]
(You also have not provided a reproducible data example. Next time
bring data.)
Theis works to sample 3 from each of the the distinct categories in
the warpbreaks data object:
by(warpbreaks, list(warpbreaks$wool, warpbreaks$tension),
FUN=function(x) x[sample(1:nrow(x), 3), ] ) #returns a list with 6
members each of which has a three row dataframe
And this would stick them back together in on dataframe:
do.call(rbind, by(warpbreaks, list(warpbreaks$wool, warpbreaks
$tension), FUN=function(x) x[sample(1:nrow(x), 3), ] ) )
--
David.
>
> only somehow combining it with the 'by' statement (e.g. by (data4,
> list (GIS_station, Distance_code2).......)) but I cannot get this to
> work.
>
> Any guidance on this much appreciated.
>
> Thankyou.
David Winsemius, MD
West Hartford, CT