Adam Carr
2009-Dec-21 15:12 UTC
[R] Question About Repeat Random Sampling from a Data Frame
Good Morning: I've read many, many posts on the r-help system and I feel compelled to quickly admit that I am relatively new to R, I do have several reference books around me, but I cannot count myself among the fortunate who seem to strong programming intuition. I have a data set consisting of 1637 observations of five variables: tensile strength, yield strength, elongation, hardness and a character indicator with three levels: (Y)es, (N)o, and (F)ail. My objective is to randomly sample various subsets from this data set and then evaluate these subsets using simple parameters among them tests for normality, shape and skewness. The data set is ordered by the character variable prior to sampling, and the samples are weighted to mirror representation in an overall, physical process. I am sampling the data set using this code: sample <- dataset[sample(1:1637, 500, prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE),] What I would like to do is iterate this process to create many (say 500 or more) sampled sets of n=500 and then evaluate each set for the parameters of interest. I would actually be evaluating each variable within each subset for my characteristic of interest. I am familiar with sampling and saving single columns of data to do this sort of thing, but I am not sure how to accomplish this with a multiple-variable data set. For example, I am currently iterating this using a clunky process: mysamples<-list() for (i in 1:10){ mysamples[[i]] <- dataset[ sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE), ] } But this leaves me with the additional task of defining each mysample[i] iteration and converting it to a form on which I can apply a standard statistical test like mean() or skewness() to the variable columns within each subset. I have attempted to iteratively convert these lists using this code: mat<-matrix(nrow=100,ncol=5) for (i in 1:length(mysamples)) {mat[i]<-do.call('rbind',mysamples[i])} but running the code generates the error message: number of items to replace is not a multiple of replacement length. I have tried unsuccessfully, by reading many, many helpful r-help emails on this error, to understand my probably obvious mistake. Based on the small amount that I think I know about R it seems to me that sampling the data frame and containing the samples in a list is likely a pretty inefficient way to do this task. Any help that any of you could provide to assist me in iteratively sampling the data frame, and storing the samples in a form on which I can apply other statistical tests would be greatly appreciated. Thank you very much for taking the time to consider my questions. Adam [[alternative HTML version deleted]]
Gustaf Rydevik
2009-Dec-21 15:20 UTC
[R] Question About Repeat Random Sampling from a Data Frame
On Mon, Dec 21, 2009 at 4:12 PM, Adam Carr <adamlcarr at yahoo.com> wrote:> Good Morning: > > I've read many, many posts on the r-help system and I feel compelled to quickly admit that I am relatively new to R, I do have several reference books around me, but I cannot count myself among the fortunate who seem to strong programming intuition. > > I have a data set consisting of 1637 observations of five variables: tensile strength, yield strength, elongation, hardness and a character indicator with three levels: (Y)es, (N)o, and (F)ail. > > My objective is to randomly sample various subsets from this data set and then evaluate these subsets using simple parameters among them tests for normality, shape and skewness. The data set is ordered by the character variable prior to sampling, and the samples?are weighted to mirror representation in an overall, physical process. > > I am sampling the data?set using this code: > > sample?<- dataset[sample(1:1637, 500, prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE),] > > What I would like to do is iterate this process?to create many (say 500 or more) sampled sets of n=500 and then evaluate each set for the parameters of interest. I would actually be evaluating each variable within each subset for my characteristic of interest. I am familiar with sampling and saving single columns of data to do this sort of thing, but I am not sure how to accomplish this with a multiple-variable data set. > > For example, I am currently iterating this using a clunky process: > > mysamples<-list() > for (i in 1:10){ > mysamples[[i]] <- dataset[ sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE), ] > } > > But this leaves me with the additional task of defining each mysample[i] iteration and converting it to a form on which I can apply a standard statistical test like mean() or skewness() to the variable columns within each subset. I have attempted to iteratively convert these lists using this code: > > mat<-matrix(nrow=100,ncol=5) > for (i in 1:length(mysamples)) > {mat[i]<-do.call('rbind',mysamples[i])} > > but running the code generates the error message: number of items to replace is not a multiple of replacement length. I have tried unsuccessfully, by reading many, many helpful?r-help emails on this error,?to understand my probably obvious mistake. > > Based on the small amount that I think I know about R it seems to me that sampling the data frame and containing the samples in a list is likely a pretty inefficient way to do this task. Any help that any of you could provide to assist me in iteratively sampling the data frame, and storing the samples in a form?on which I can apply other statistical tests would be greatly appreciated. > > Thank you very much for taking the time to consider my questions. > > Adam > > > > ? ? ? ?[[alternative HTML version deleted]]That's pretty much how I tend to do those things. what you seem to be missing is the ?apply family: mysamples.means<-lapply(mysamples,function(x)mean(x[,1])) Hope that gets you on your way. If you want more help, I'd suggest including an example data set in your follow-up messages. /Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik
David Winsemius
2009-Dec-21 16:23 UTC
[R] Question About Repeat Random Sampling from a Data Frame
On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:> Good Morning: > > I've read many, many posts on the r-help system and I feel compelled > to quickly admit that I am relatively new to R, I do have several > reference books around me, but I cannot count myself among the > fortunate who seem to strong programming intuition. > > I have a data set consisting of 1637 observations of five variables: > tensile strength, yield strength, elongation, hardness and a > character indicator with three levels: (Y)es, (N)o, and (F)ail. > > My objective is to randomly sample various subsets from this data > set and then evaluate these subsets using simple parameters among > them tests for normality, shape and skewness. The data set is > ordered by the character variable prior to sampling, and the samples > are weighted to mirror representation in an overall, physical process. > > I am sampling the data set using this code: > > sample <- dataset[sample(1:1637, 500, > prob > = > c > (rep > (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = > TRUE),] > > What I would like to do is iterate this process to create many (say > 500 or more) sampled sets of n=500 and then evaluate each set for > the parameters of interest. I would actually be evaluating each > variable within each subset for my characteristic of interest. I am > familiar with sampling and saving single columns of data to do this > sort of thing, but I am not sure how to accomplish this with a > multiple-variable data set. > > For example, I am currently iterating this using a clunky process: > > mysamples<-list() > for (i in 1:10){ > mysamples[[i]] <- > dataset > [ sample > (1 > : > 1637,100 > ,prob > = > c > (rep > (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = > TRUE), ] > } >Using lists to store intermediate results is not considered clunky in R. (You might want to provide statistical justification for the otherwise puzzling sampling strategy.)> But this leaves me with the additional task of defining each > mysample[i] iteration and converting it to a form on which I can > apply a standard statistical test like mean() or skewness() to the > variable columns within each subset. I have attempted to iteratively > convert these lists using this code: > > mat<-matrix(nrow=100,ncol=5) > for (i in 1:length(mysamples)) > {mat[i]<-do.call('rbind',mysamples[i])}It would help if you explained what you are attempting here in ordinary English. There are 10 elements in mysamples, each of which is a 100 x 5 dataframe, and mat is just one 100 x 5 matrix, which you seem to be referencing incorrectly, given the fact that it has two, rather than one, dimension. Furthermore, those dataframes may not be of a uniform class, since you said you had character variable. Do you really want these all in a character type matrix, which would be what is likely to happen given R's requirement that matrix element be of only one class? What you say above suggests not.> > but running the code generates the error message: number of items to > replace is not a multiple of replacement length.Because of the way you are referencing the matrix, probably. If you wanted a 10 x 100 x 5 array, then create an array. In R, as far as I can tell anyway, matrices are necessarily of 2 dimensions. Tables and arrays can be of higher dimension.> I have tried unsuccessfully, by reading many, many helpful r-help > emails on this error, to understand my probably obvious mistake.Sorting out such problems is best done with smaller test objects. I was surprised to see that you thought it was necessary to convert dataframes to matrices in order to calculate descriptive statistics. Nothing could be farther from the truth. Furthermore, it for some other more valid reason you wanted a list of matrices, there is a perfectly good function that will convert a dataframe to a matrix, data.matrix(), remembering of course that if there is a single character variable in the dataframe, that the entire matrix will be of type character.> > Based on the small amount that I think I know about R it seems to me > that sampling the data frame and containing the samples in a list is > likely a pretty inefficient way to do this task. Any help that any > of you could provide to assist me in iteratively sampling the data > frame, and storing the samples in a form on which I can apply other > statistical tests would be greatly appreciated. > > Thank you very much for taking the time to consider my questions.-- David Winsemius, MD Heritage Laboratories West Hartford, CT
Adam Carr
2009-Dec-22 11:48 UTC
[R] Question About Repeat Random Sampling from a Data Frame
Thanks to both of you for the comments and suggestions. Over the next couple of days I plan to work through my simple problem using the help offered in this forum. ________________________________ From: David Winsemius <dwinsemius@comcast.net> To: Bert Gunter <gunter.berton@gene.com> Sent: Mon, December 21, 2009 2:31:26 PM Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame On Dec 21, 2009, at 1:01 PM, Bert Gunter wrote:> Didn't read this thread in detail, so the following suggestion may just be > nonsense... (caveat emptor), but: > > To sample from an data frame or matrix, sample from the row indices and then > extract what you want from the sampled rows. Or sample directly from > individual columns if that suffices. In general, > > ?sample > > on appropriate indices of object in question. > > Bert Gunter > Genentech Nonclinical Biostatistics > > > -----Original Message----- > From: r-help-bounces@r-project.org [mailto:r-help-bounces@r-project.org] On > Behalf Of Adam Carr > Sent: Monday, December 21, 2009 9:53 AM > To: David Winsemius > Cc: r-help@r-project.org > Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame > > Good Afternoon Dr. Winsemius: > > You ask some very good questions and make excellent points; my responses are > below. I've tried to extract your questions and provide answers just to > reduce the clutter. > > 1. You might want to provide statistical justification for the otherwise > puzzling sampling strategy. > > I assume you mean my overall process of random sampling from a large data > set. The data set is comprised of observations collected over four years. > Although the basis for sampling would make a good four-frame Dilbert cartoon > if it could be condensed enough, my answer begins with the unfortunate truth > that there is a great divide between the technical and marketing groups at > the business where I am employed. Many powerful marketing executives, some > with technical backgrounds, feel that there is something fundamentally wrong > with the manufacturing process because the data generated over the long term > is not approximately normally distributed. My task was to examine this set > of data, trying to keep the representation of Y, N and F approximately equal > in the sample when compared to the large set, to determine if any subset > exhibits the holy grail-like normal distribution characteristics. I don't > feel that this is statistical justification, but it is the > reason why I am doing this. > > 2. It would help if you explained what you are attempting here in ordinary > English. There are 10 elements in mysamples, each of which is a 100 x 5 > dataframe, and mat is just one 100 x 5 matrix, which you seem to be > referencing incorrectly, given the fact that it has two, rather than one, > dimension. Furthermore, those dataframes may not be of a uniform class, > since you said you had character variable. Do you really want these all in a > character type matrix, which would be what is likely to happen given R's > requirement that matrix element be of only one class? What you say above > suggests not. > > It seems from your response that I incorrectly assumed that a list is not > the same as a data frame. I started down this path after reading the > questions and answers to a similar problem where the r-help responder > suggested a two step process and said that the list must be converted to > another form in order to be available for analysis.A data.frame is a special type of list. You can also make lists of dataframes (just as you can make lists of lists), which I thought the first portion of your code would have done: mysamples<-list() for (i in 1:10){ mysamples[[i]] <- dataset[ sample(1:1637,100, prob=c(rep(163.7/1637,513), rep(245.5/1637,197), rep(1227.8/1637,927)), replace = TRUE), ] Each element in that list would have been a subset of your larger data.frame and would itself have been a data.frame.> > And you are absolutely correct that I do not want each sample in a character > type matrix. > > In plain English, I hope, I am simply trying to iterate the process of > removing random samples from the large data set, and then saving these > samples in a format that is available for simple analysis. For example, if I > remove five hundred mysample sets, each of which is composed of a 100 x 5 > sample of the large data set I am interested in determining the skewness, > kurtosis, mean and standard deviation of each of the four numeric variables > in each of the five hundred mysample sets.So make a small dataframe with variables (columns) of the same type as in your real data, maybe 25-30 rows in "extent" (not "length", since for a dataframe, the length() function returns the number of columns).> > 3. Sorting out such problems is best done with smaller test objects. I was > surprised to see...type character. > > I agree. I began to do this with a small test data set but it was late last > evening and I realized that I should ask for help before proceeding on what > I thought might be incorrect assumptions. I clearly misunderstood that a > list needed to be converted to a data frame in order to be available for > analysis.Well, if each list element is already a data.frame then no conversions are needed. The lapply function can be used to "loop" over a list, and you can define a function that will only look at particular components of those elements. There are also functions in packages that automate the process. The describe function in Hmisc looksa t each column and decides what type it is> > Thank you for taking the time to respond. The discussion and suggestions are > very helpful. > > Adam > > ________________________________ > From: David Winsemius <dwinsemius@comcast.net> > > Cc: r-help@r-project.org > Sent: Mon, December 21, 2009 11:23:43 AM > Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame > > > On Dec 21, 2009, at 10:12 AM, Adam Carr wrote: > >> Good Morning: >> >> I've read many, many posts on the r-help system and I feel compelled to > quickly admit that I am relatively new to R, I do have several reference > books around me, but I cannot count myself among the fortunate who seem to > strong programming intuition. >> >> I have a data set consisting of 1637 observations of five variables: > tensile strength, yield strength, elongation, hardness and a character > indicator with three levels: (Y)es, (N)o, and (F)ail. >> >> My objective is to randomly sample various subsets from this data set and > then evaluate these subsets using simple parameters among them tests for > normality, shape and skewness. The data set is ordered by the character > variable prior to sampling, and the samples are weighted to mirror > representation in an overall, physical process. >> >> I am sampling the data set using this code: >> >> sample <- dataset[sample(1:1637, 500, > prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace > = TRUE),] >> >> What I would like to do is iterate this process to create many (say 500 or > more) sampled sets of n=500 and then evaluate each set for the parameters of > interest. I would actually be evaluating each variable within each subset > for my characteristic of interest. I am familiar with sampling and saving > single columns of data to do this sort of thing, but I am not sure how to > accomplish this with a multiple-variable data set. >> >> For example, I am currently iterating this using a clunky process: >> >> mysamples<-list() >> for (i in 1:10){ >> mysamples[[i]] <- dataset[ > sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/ > 1637,927)),replace = TRUE), ] >> } >> > > Using lists to store intermediate results is not considered clunky in R. > (You might want to provide statistical justification for the otherwise > puzzling sampling strategy.) > >> But this leaves me with the additional task of defining each mysample[i] > iteration and converting it to a form on which I can apply a standard > statistical test like mean() or skewness() to the variable columns within > each subset. I have attempted to iteratively convert these lists using this > code: >> >> mat<-matrix(nrow=100,ncol=5) >> for (i in 1:length(mysamples)) >> {mat[i]<-do.call('rbind',mysamples[i])} > > It would help if you explained what you are attempting here in ordinary > English. There are 10 elements in mysamples, each of which is a 100 x 5 > dataframe, and mat is just one 100 x 5 matrix, which you seem to be > referencing incorrectly, given the fact that it has two, rather than one, > dimension. Furthermore, those dataframes may not be of a uniform class, > since you said you had character variable. Do you really want these all in a > character type matrix, which would be what is likely to happen given R's > requirement that matrix element be of only one class? What you say above > suggests not. > >> >> but running the code generates the error message: number of items to > replace is not a multiple of replacement length. > > Because of the way you are referencing the matrix, probably. If you wanted a > 10 x 100 x 5 array, then create an array. In R, as far as I can tell anyway, > matrices are necessarily of 2 dimensions. Tables and arrays can be of higher > dimension. > >> I have tried unsuccessfully, by reading many, many helpful r-help emails > on this error, to understand my probably obvious mistake. > > Sorting out such problems is best done with smaller test objects. I was > surprised to see that you thought it was necessary to convert dataframes to > matrices in order to calculate descriptive statistics. Nothing could be > farther from the truth. Furthermore, it for some other more valid reason you > wanted a list of matrices, there is a perfectly good function that will > convert a dataframe to a matrix, data.matrix(), remembering of course that > if there is a single character variable in the dataframe, that the entire > matrix will be of type character. >> >> Based on the small amount that I think I know about R it seems to me that > sampling the data frame and containing the samples in a list is likely a > pretty inefficient way to do this task. Any help that any of you could > provide to assist me in iteratively sampling the data frame, and storing the > samples in a form on which I can apply other statistical tests would be > greatly appreciated. >> >> Thank you very much for taking the time to consider my questions. > --David Winsemius, MD Heritage Laboratories West Hartford, CT [[alternative HTML version deleted]]