thr3ads.net - R help - [R] Sampling a dataframe based on the length of a subset of observations within [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Eric Vander Wal

2009-Jul-09 01:19 UTC

[R] Sampling a dataframe based on the length of a subset of observations within

Thank you in advance for your consideration.

I have a dataframe of 2000+ observations with repeated measures across 
approximately 300 unique individuals  An event either does or does not 
happen
(1,0) and there is a suit of independent variables associated with the 
event.  A simplified representation follows:

my.df<-data.frame("id"=c("A","A","A","B","B","C","C","C",
"C", "C"),
event=c(0,0,1,0,1,0,0,1,1, 0))

_id_  _event_
A     0
A     0
A     1
B     0
B     1
C     0
C     0
C     1
C     1
C     0

I need to sample my.df to select the same number of observations with 
event = 0 as event = 1 for each unique id.
I can reshape or tapply my.df to group id and determine what sample size 
I need.  my.df.cast
library(reshape)
my.df.melt<-melt(my.df, id="id")
my.df.cast<-cast(my.df.melt, id~value, length, fill=0)
my.df.cast

       Event
_id_      _0_   _1_
A     2     *1*
B     1     *1*
C     3     *2*

Given the above dataframe I need to randomly select (sample) from my.df 
*one* observation from my.df[my.df$id==A & my.df$event==0],  *one* from 
my.df[my.df$id==B & my.df$event==0], and* two* from my.df[my.df$id==C & 
my.df$event==0] and then rbind them to my.df[my.df$event == 1].  
However, it is impractical to individually code each case.

Alternatively if A in my.df matches A in my.df.cast  then 
sample(my.df[my.df$id == A & my.df$event == 0], size=my.df.cast[1,3], 
replace=FALSE).  I think I am close to a solution but I'm not sure how 
to code it to run through the entire dataframe.

This is how my.new.df would look:

_id event_
A     0
A     1
B     0
B     1
C     0
C     0
C     1
C     1

Thank you kindly for your help,

Eric

-- 
Eric Vander Wal
Ph.D. Candidate
University of Saskatchewan, 
Department of Biology,
112 Science Place, 
Saskatoon, SK., S7N 5E2

"Pluralitas non est ponenda sine neccesitate"

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Jul 2009 - Sampling a dataframe based on the length of a subset of observations within

[R] Sampling a dataframe based on the length of a subset of observations within

Possibly Parallel Threads

Wisdom of the Ancients