Eric Vander Wal
2009-Jul-09 01:19 UTC
[R] Sampling a dataframe based on the length of a subset of observations within
Thank you in advance for your consideration. I have a dataframe of 2000+ observations with repeated measures across approximately 300 unique individuals An event either does or does not happen (1,0) and there is a suit of independent variables associated with the event. A simplified representation follows: my.df<-data.frame("id"=c("A","A","A","B","B","C","C","C", "C", "C"), event=c(0,0,1,0,1,0,0,1,1, 0)) _id_ _event_ A 0 A 0 A 1 B 0 B 1 C 0 C 0 C 1 C 1 C 0 I need to sample my.df to select the same number of observations with event = 0 as event = 1 for each unique id. I can reshape or tapply my.df to group id and determine what sample size I need. my.df.cast library(reshape) my.df.melt<-melt(my.df, id="id") my.df.cast<-cast(my.df.melt, id~value, length, fill=0) my.df.cast Event _id_ _0_ _1_ A 2 *1* B 1 *1* C 3 *2* Given the above dataframe I need to randomly select (sample) from my.df *one* observation from my.df[my.df$id==A & my.df$event==0], *one* from my.df[my.df$id==B & my.df$event==0], and* two* from my.df[my.df$id==C & my.df$event==0] and then rbind them to my.df[my.df$event == 1]. However, it is impractical to individually code each case. Alternatively if A in my.df matches A in my.df.cast then sample(my.df[my.df$id == A & my.df$event == 0], size=my.df.cast[1,3], replace=FALSE). I think I am close to a solution but I'm not sure how to code it to run through the entire dataframe. This is how my.new.df would look: _id event_ A 0 A 1 B 0 B 1 C 0 C 0 C 1 C 1 Thank you kindly for your help, Eric -- Eric Vander Wal Ph.D. Candidate University of Saskatchewan, Department of Biology, 112 Science Place, Saskatoon, SK., S7N 5E2 "Pluralitas non est ponenda sine neccesitate"