thr3ads.net - R help - [R] Question About Repeat Random Sampling from a Data Frame [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Adam Carr

2009-Dec-21 15:12 UTC

[R] Question About Repeat Random Sampling from a Data Frame

Good Morning:

I've read many, many posts on the r-help system and I feel compelled to
quickly admit that I am relatively new to R, I do have several reference books
around me, but I cannot count myself among the fortunate who seem to strong
programming intuition.

I have a data set consisting of 1637 observations of five variables: tensile
strength, yield strength, elongation, hardness and a character indicator with
three levels: (Y)es, (N)o, and (F)ail.

My objective is to randomly sample various subsets from this data set and then
evaluate these subsets using simple parameters among them tests for normality,
shape and skewness. The data set is ordered by the character variable prior to
sampling, and the samples are weighted to mirror representation in an overall,
physical process.

I am sampling the data set using this code:

sample <- dataset[sample(1:1637, 500,
prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =
TRUE),]

What I would like to do is iterate this process to create many (say 500 or more)
sampled sets of n=500 and then evaluate each set for the parameters of interest.
I would actually be evaluating each variable within each subset for my
characteristic of interest. I am familiar with sampling and saving single
columns of data to do this sort of thing, but I am not sure how to accomplish
this with a multiple-variable data set.

For example, I am currently iterating this using a clunky process:

mysamples<-list()
for (i in 1:10){
mysamples[[i]] <- dataset[
sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
= TRUE), ]
}

But this leaves me with the additional task of defining each mysample[i]
iteration and converting it to a form on which I can apply a standard
statistical test like mean() or skewness() to the variable columns within each
subset. I have attempted to iteratively convert these lists using this code:

mat<-matrix(nrow=100,ncol=5)
for (i in 1:length(mysamples))
{mat[i]<-do.call('rbind',mysamples[i])}

but running the code generates the error message: number of items to replace is
not a multiple of replacement length. I have tried unsuccessfully, by reading
many, many helpful r-help emails on this error, to understand my probably
obvious mistake.

Based on the small amount that I think I know about R it seems to me that
sampling the data frame and containing the samples in a list is likely a pretty
inefficient way to do this task. Any help that any of you could provide to
assist me in iteratively sampling the data frame, and storing the samples in a
form on which I can apply other statistical tests would be greatly appreciated.

Thank you very much for taking the time to consider my questions.

Adam 


      
	[[alternative HTML version deleted]]

Gustaf Rydevik

2009-Dec-21 15:20 UTC

head link

[R] Question About Repeat Random Sampling from a Data Frame

On Mon, Dec 21, 2009 at 4:12 PM, Adam Carr <adamlcarr at yahoo.com>
wrote:> Good Morning:
>
> I've read many, many posts on the r-help system and I feel compelled to
quickly admit that I am relatively new to R, I do have several reference books
around me, but I cannot count myself among the fortunate who seem to strong
programming intuition.
>
> I have a data set consisting of 1637 observations of five variables:
tensile strength, yield strength, elongation, hardness and a character indicator
with three levels: (Y)es, (N)o, and (F)ail.
>
> My objective is to randomly sample various subsets from this data set and
then evaluate these subsets using simple parameters among them tests for
normality, shape and skewness. The data set is ordered by the character variable
prior to sampling, and the samples?are weighted to mirror representation in an
overall, physical process.
>
> I am sampling the data?set using this code:
>
> sample?<- dataset[sample(1:1637, 500,
prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =
TRUE),]
>
> What I would like to do is iterate this process?to create many (say 500 or
more) sampled sets of n=500 and then evaluate each set for the parameters of
interest. I would actually be evaluating each variable within each subset for my
characteristic of interest. I am familiar with sampling and saving single
columns of data to do this sort of thing, but I am not sure how to accomplish
this with a multiple-variable data set.
>
> For example, I am currently iterating this using a clunky process:
>
> mysamples<-list()
> for (i in 1:10){
> mysamples[[i]] <- dataset[
sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
= TRUE), ]
> }
>
> But this leaves me with the additional task of defining each mysample[i]
iteration and converting it to a form on which I can apply a standard
statistical test like mean() or skewness() to the variable columns within each
subset. I have attempted to iteratively convert these lists using this code:
>
> mat<-matrix(nrow=100,ncol=5)
> for (i in 1:length(mysamples))
> {mat[i]<-do.call('rbind',mysamples[i])}
>
> but running the code generates the error message: number of items to
replace is not a multiple of replacement length. I have tried unsuccessfully, by
reading many, many helpful?r-help emails on this error,?to understand my
probably obvious mistake.
>
> Based on the small amount that I think I know about R it seems to me that
sampling the data frame and containing the samples in a list is likely a pretty
inefficient way to do this task. Any help that any of you could provide to
assist me in iteratively sampling the data frame, and storing the samples in a
form?on which I can apply other statistical tests would be greatly appreciated.
>
> Thank you very much for taking the time to consider my questions.
>
> Adam
>
>
>
> ? ? ? ?[[alternative HTML version deleted]]
That's pretty much how I tend to do those things. what you seem to be
missing is the ?apply family:

mysamples.means<-lapply(mysamples,function(x)mean(x[,1]))


Hope that gets you on your way. If you want more help, I'd suggest
including an example data set in your follow-up messages.

/Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik

David Winsemius

2009-Dec-21 16:23 UTC

head link

[R] Question About Repeat Random Sampling from a Data Frame

On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:
> Good Morning:
>
> I've read many, many posts on the r-help system and I feel compelled  
> to quickly admit that I am relatively new to R, I do have several  
> reference books around me, but I cannot count myself among the  
> fortunate who seem to strong programming intuition.
>
> I have a data set consisting of 1637 observations of five variables:  
> tensile strength, yield strength, elongation, hardness and a  
> character indicator with three levels: (Y)es, (N)o, and (F)ail.
>
> My objective is to randomly sample various subsets from this data  
> set and then evaluate these subsets using simple parameters among  
> them tests for normality, shape and skewness. The data set is  
> ordered by the character variable prior to sampling, and the samples  
> are weighted to mirror representation in an overall, physical process.
>
> I am sampling the data set using this code:
>
> sample <- dataset[sample(1:1637, 500,  
> prob 
> = 
> c 
> (rep 
> (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
> TRUE),]
>
> What I would like to do is iterate this process to create many (say  
> 500 or more) sampled sets of n=500 and then evaluate each set for  
> the parameters of interest. I would actually be evaluating each  
> variable within each subset for my characteristic of interest. I am  
> familiar with sampling and saving single columns of data to do this  
> sort of thing, but I am not sure how to accomplish this with a  
> multiple-variable data set.
>
> For example, I am currently iterating this using a clunky process:
>
> mysamples<-list()
> for (i in 1:10){
> mysamples[[i]] <-  
> dataset 
> [ sample 
> (1 
> : 
> 1637,100 
> ,prob 
> = 
> c 
> (rep 
> (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
> TRUE), ]
> }
>
Using lists to store intermediate results is not considered clunky in  
R. (You might want to provide statistical justification for the  
otherwise puzzling sampling strategy.)
> But this leaves me with the additional task of defining each  
> mysample[i] iteration and converting it to a form on which I can  
> apply a standard statistical test like mean() or skewness() to the  
> variable columns within each subset. I have attempted to iteratively  
> convert these lists using this code:
>
> mat<-matrix(nrow=100,ncol=5)
> for (i in 1:length(mysamples))
> {mat[i]<-do.call('rbind',mysamples[i])}
It would help if you explained what you are attempting here in  
ordinary English. There are 10 elements in mysamples, each of which is  
a 100 x 5 dataframe, and mat is just one 100 x 5 matrix, which you  
seem to be referencing incorrectly, given the fact that it has two,  
rather than one, dimension. Furthermore, those dataframes may not be  
of a uniform class, since you said you had character variable. Do you  
really want these all in a character type matrix, which would be what  
is likely to happen given R's requirement that matrix element be of  
only one class? What you say above suggests not.
>
> but running the code generates the error message: number of items to  
> replace is not a multiple of replacement length.
Because of the way you are referencing the matrix, probably. If you  
wanted a 10 x 100 x 5 array, then create an array. In R, as far as I  
can tell anyway, matrices are necessarily of 2 dimensions. Tables and  
arrays can be of higher dimension.
> I have tried unsuccessfully, by reading many, many helpful r-help  
> emails on this error, to understand my probably obvious mistake.
Sorting out such problems is best done with smaller test objects. I  
was surprised to see that you thought it was necessary to convert  
dataframes to matrices in order to calculate descriptive statistics.  
Nothing could be farther from the truth. Furthermore, it for some  
other more valid reason you wanted a list of matrices, there is a  
perfectly good function that will convert a dataframe to a matrix,  
data.matrix(), remembering of course that if there is a single  
character variable in the dataframe, that the entire matrix will be of  
type character.>
> Based on the small amount that I think I know about R it seems to me  
> that sampling the data frame and containing the samples in a list is  
> likely a pretty inefficient way to do this task. Any help that any  
> of you could provide to assist me in iteratively sampling the data  
> frame, and storing the samples in a form on which I can apply other  
> statistical tests would be greatly appreciated.
>
> Thank you very much for taking the time to consider my questions.-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

Adam Carr

2009-Dec-22 11:48 UTC

head link

[R] Question About Repeat Random Sampling from a Data Frame

Thanks to both of you for the comments and suggestions. Over the next couple of
days I plan to work through my simple problem using the help offered in this
forum.




________________________________
From: David Winsemius <dwinsemius@comcast.net>
To: Bert Gunter <gunter.berton@gene.com>

Sent: Mon, December 21, 2009 2:31:26 PM
Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame


On Dec 21, 2009, at 1:01 PM, Bert Gunter wrote:
> Didn't read this thread in detail, so the following suggestion may just
be
> nonsense... (caveat emptor), but:
> 
> To sample from an data frame or matrix, sample from the row indices and
then
> extract what you want from the sampled rows. Or sample directly from
> individual columns if that suffices. In general,
> 
> ?sample
> 
> on appropriate indices of object in question.
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> 
> -----Original Message-----
> From: r-help-bounces@r-project.org [mailto:r-help-bounces@r-project.org] On
> Behalf Of Adam Carr
> Sent: Monday, December 21, 2009 9:53 AM
> To: David Winsemius
> Cc: r-help@r-project.org
> Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame
> 
> Good Afternoon Dr. Winsemius:
> 
> You ask some very good questions and make excellent points; my responses
are
> below. I've tried to extract your questions and provide answers just to
> reduce the clutter.
> 
> 1. You might want to provide statistical justification for the otherwise
> puzzling sampling strategy.
> 
> I assume you mean my overall process of random sampling from a large data
> set. The data set is comprised of observations collected over four years.
> Although the basis for sampling would make a good four-frame Dilbert
cartoon
> if it could be condensed enough, my answer begins with the unfortunate
truth
> that there is a great divide between the technical and marketing groups at
> the business where I am employed. Many powerful marketing executives, some
> with technical backgrounds, feel that there is something fundamentally
wrong
> with the manufacturing process because the data generated over the long
term
> is not approximately normally distributed. My task was to examine this set
> of data, trying to keep the representation of Y, N and F approximately
equal
> in the sample when compared to the large set, to determine if any subset
> exhibits the holy grail-like normal distribution characteristics. I
don't
> feel that this is statistical justification, but it is the
> reason why I am doing this.
> 
> 2. It would help if you explained what you are attempting here in ordinary
> English. There are 10 elements in mysamples, each of which is a 100 x 5
> dataframe, and mat is just one 100 x 5 matrix, which you seem to be
> referencing incorrectly, given the fact that it has two, rather than one,
> dimension. Furthermore, those dataframes may not be of a uniform class,
> since you said you had character variable. Do you really want these all in
a
> character type matrix, which would be what is likely to happen given
R's
> requirement that matrix element be of only one class? What you say above
> suggests not.
> 
> It seems from your response that I incorrectly assumed that a list is not
> the same as a data frame. I started down this path after reading the
> questions and answers to a similar problem where the r-help responder
> suggested a two step process and said that the list must be converted to
> another form in order to be available for analysis.
A data.frame is a special type of list. You can also make lists of dataframes
(just as you can make lists of lists), which I thought the first portion of your
code would have done:

mysamples<-list()
for (i in 1:10){
mysamples[[i]] <- dataset[ sample(1:1637,100, prob=c(rep(163.7/1637,513),
rep(245.5/1637,197), rep(1227.8/1637,927)), replace = TRUE), ]

Each element in that list would have been a subset of your larger data.frame and
would itself have been a data.frame.

> 
> And you are absolutely correct that I do not want each sample in a
character
> type matrix.
> 
> In plain English, I hope, I am simply trying to iterate the process of
> removing random samples from the large data set, and then saving these
> samples in a format that is available for simple analysis. For example, if
I
> remove five hundred mysample sets, each of which is composed of a 100 x 5
> sample of the large data set I am interested in determining the skewness,
> kurtosis, mean and standard deviation of each of the four numeric variables
> in each of the five hundred mysample sets.
So make a small dataframe with variables (columns) of the same type as in your
real data, maybe 25-30 rows in "extent" (not "length", since
for a dataframe, the length() function returns the number of
columns).> 
> 3. Sorting out such problems is best done with smaller test objects. I was
> surprised to see...type character.
> 
> I agree. I began to do this with a small test data set but it was late last
> evening and I realized that I should ask for help before proceeding on what
> I thought might be incorrect assumptions. I clearly misunderstood that a
> list needed to be converted to a data frame in order to be available for
> analysis.
Well, if each list element is already a data.frame then no conversions are
needed. The lapply function can be used to "loop" over a list, and you
can define a function that will only look at particular components of those
elements. There are also functions in packages that automate the process. The
describe function in Hmisc looksa t each column and decides what type it is
> 
> Thank you for taking the time to respond. The discussion and suggestions
are
> very helpful.
> 
> Adam
> 
> ________________________________
> From: David Winsemius <dwinsemius@comcast.net>
> 
> Cc: r-help@r-project.org
> Sent: Mon, December 21, 2009 11:23:43 AM
> Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame
> 
> 
> On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:
> 
>> Good Morning:
>> 
>> I've read many, many posts on the r-help system and I feel
compelled to
> quickly admit that I am relatively new to R, I do have several reference
> books around me, but I cannot count myself among the fortunate who seem to
> strong programming intuition.
>> 
>> I have a data set consisting of 1637 observations of five variables:
> tensile strength, yield strength, elongation, hardness and a character
> indicator with three levels: (Y)es, (N)o, and (F)ail.
>> 
>> My objective is to randomly sample various subsets from this data set
and
> then evaluate these subsets using simple parameters among them tests for
> normality, shape and skewness. The data set is ordered by the character
> variable prior to sampling, and the samples are weighted to mirror
> representation in an overall, physical process.
>> 
>> I am sampling the data set using this code:
>> 
>> sample <- dataset[sample(1:1637, 500,
>
prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
> = TRUE),]
>> 
>> What I would like to do is iterate this process to create many (say 500
or
> more) sampled sets of n=500 and then evaluate each set for the parameters
of
> interest. I would actually be evaluating each variable within each subset
> for my characteristic of interest. I am familiar with sampling and saving
> single columns of data to do this sort of thing, but I am not sure how to
> accomplish this with a multiple-variable data set.
>> 
>> For example, I am currently iterating this using a clunky process:
>> 
>> mysamples<-list()
>> for (i in 1:10){
>> mysamples[[i]] <- dataset[
>
sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/
> 1637,927)),replace = TRUE), ]
>> }
>> 
> 
> Using lists to store intermediate results is not considered clunky in R.
> (You might want to provide statistical justification for the otherwise
> puzzling sampling strategy.)
> 
>> But this leaves me with the additional task of defining each
mysample[i]
> iteration and converting it to a form on which I can apply a standard
> statistical test like mean() or skewness() to the variable columns within
> each subset. I have attempted to iteratively convert these lists using this
> code:
>> 
>> mat<-matrix(nrow=100,ncol=5)
>> for (i in 1:length(mysamples))
>> {mat[i]<-do.call('rbind',mysamples[i])}
> 
> It would help if you explained what you are attempting here in ordinary
> English. There are 10 elements in mysamples, each of which is a 100 x 5
> dataframe, and mat is just one 100 x 5 matrix, which you seem to be
> referencing incorrectly, given the fact that it has two, rather than one,
> dimension. Furthermore, those dataframes may not be of a uniform class,
> since you said you had character variable. Do you really want these all in
a
> character type matrix, which would be what is likely to happen given
R's
> requirement that matrix element be of only one class? What you say above
> suggests not.
> 
>> 
>> but running the code generates the error message: number of items to
> replace is not a multiple of replacement length.
> 
> Because of the way you are referencing the matrix, probably. If you wanted
a
> 10 x 100 x 5 array, then create an array. In R, as far as I can tell
anyway,
> matrices are necessarily of 2 dimensions. Tables and arrays can be of
higher
> dimension.
> 
>> I have tried unsuccessfully, by reading many, many helpful r-help
emails
> on this error, to understand my probably obvious mistake.
> 
> Sorting out such problems is best done with smaller test objects. I was
> surprised to see that you thought it was necessary to convert dataframes to
> matrices in order to calculate descriptive statistics. Nothing could be
> farther from the truth. Furthermore, it for some other more valid reason
you
> wanted a list of matrices, there is a perfectly good function that will
> convert a dataframe to a matrix, data.matrix(), remembering of course that
> if there is a single character variable in the dataframe, that the entire
> matrix will be of type character.
>> 
>> Based on the small amount that I think I know about R it seems to me
that
> sampling the data frame and containing the samples in a list is likely a
> pretty inefficient way to do this task. Any help that any of you could
> provide to assist me in iteratively sampling the data frame, and storing
the
> samples in a form on which I can apply other statistical tests would be
> greatly appreciated.
>> 
>> Thank you very much for taking the time to consider my questions.
> --
David Winsemius, MD
Heritage Laboratories
West Hartford, CT


      
	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more maybe matching threads

R help - Dec 2009 - Question About Repeat Random Sampling from a Data Frame

[R] Question About Repeat Random Sampling from a Data Frame

[R] Question About Repeat Random Sampling from a Data Frame

[R] Question About Repeat Random Sampling from a Data Frame

[R] Question About Repeat Random Sampling from a Data Frame

Apparently Analagous Threads