I have data in a "long" format where each row is a student and each student occupies multiple rows with multiple observations. I need to subset these data based on a condition which I am having difficulty defining. The dataset I am working with is large, but here is a simple data structure to illustrate the issue tmp <- data.frame(id = 1:3, matrix(rnorm(30), ncol=10) ) long <- reshape(tmp, idvar='id', varying=list(names(tmp)[2:11]), v.names=('item'),timevar='position' , direction='long') long <- long[order(long$id) , ] long <- long[c(-2,-13),] What I need to do is subset these data so I have the first 6 rows for each unique ID. The problem is that the data are unbalanced in that each ID has a different number of observations (which I why I removed obs 2 and 13). If the data were balanced, the subset would be trivial and I could just do long <- subset(long, position < 7) However, the data are not balanced. Consequently, if I were to do this for the unbalanced data I would not have the first 6 obs for the first ID. I would only have the first 5. Theoretically, what I want for id1(and for each unique id) is this ID1 <- subset(long, id==1) ID1[1:6,] However, the goal is to subset the entire dataframe at once such that the subset returns a new dataframe with the first 6 rows for each unique id. Is there a feasible method for doing this subset that anyone can suggest? My actual dataset has more than 24,000 unique ids, so I am hoping to avoid looping through this if possible. Thanks, Harold [[alternative HTML version deleted]]
Apologies, but there were some word wrap issues in the prior email it seems. So, here is code for the sample data to avoid confusion tmp <- data.frame(id = 1:3, matrix(rnorm(30), ncol=10) ) long <- reshape(tmp, idvar='id', varying=list(names(tmp)[2:11]), v.names=('item'),timevar='position' , direction='long') long <- long[order(long$id) , ] long <- long[c(-2,-13),]> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Doran, Harold > Sent: Tuesday, June 06, 2006 5:08 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Subset data in long format > > I have data in a "long" format where each row is a student > and each student occupies multiple rows with multiple > observations. I need to subset these data based on a > condition which I am having difficulty defining. > > The dataset I am working with is large, but here is a simple > data structure to illustrate the issue > > tmp <- data.frame(id = 1:3, matrix(rnorm(30), ncol=10) ) long > <- reshape(tmp, idvar='id', varying=list(names(tmp)[2:11]), > v.names=('item'),timevar='position' , direction='long') long > <- long[order(long$id) , ] long <- long[c(-2,-13),] > > What I need to do is subset these data so I have the first 6 > rows for each unique ID. The problem is that the data are > unbalanced in that each ID has a different number of > observations (which I why I removed obs 2 and 13). > > If the data were balanced, the subset would be trivial and I > could just do > > long <- subset(long, position < 7) > > However, the data are not balanced. Consequently, if I were > to do this for the unbalanced data I would not have the first > 6 obs for the first ID. I would only have the first 5. > Theoretically, what I want for id1(and for each unique id) is this > > ID1 <- subset(long, id==1) > ID1[1:6,] > > However, the goal is to subset the entire dataframe at once > such that the subset returns a new dataframe with the first 6 > rows for each unique id. Is there a feasible method for doing > this subset that anyone can suggest? My actual dataset has > more than 24,000 unique ids, so I am hoping to avoid looping > through this if possible. > > Thanks, > Harold > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Try this: subset(long, seq(id) - match(id,id) < 6) On 6/6/06, Doran, Harold <HDoran at air.org> wrote:> I have data in a "long" format where each row is a student and each > student occupies multiple rows with multiple observations. I need to > subset these data based on a condition which I am having difficulty > defining. > > The dataset I am working with is large, but here is a simple data > structure to illustrate the issue > > tmp <- data.frame(id = 1:3, matrix(rnorm(30), ncol=10) ) > long <- reshape(tmp, idvar='id', varying=list(names(tmp)[2:11]), > v.names=('item'),timevar='position' , direction='long') > long <- long[order(long$id) , ] > long <- long[c(-2,-13),] > > What I need to do is subset these data so I have the first 6 rows for > each unique ID. The problem is that the data are unbalanced in that each > ID has a different number of observations (which I why I removed obs 2 > and 13). > > If the data were balanced, the subset would be trivial and I could just > do > > long <- subset(long, position < 7) > > However, the data are not balanced. Consequently, if I were to do this > for the unbalanced data I would not have the first 6 obs for the first > ID. I would only have the first 5. Theoretically, what I want for > id1(and for each unique id) is this > > ID1 <- subset(long, id==1) > ID1[1:6,] > > However, the goal is to subset the entire dataframe at once such that > the subset returns a new dataframe with the first 6 rows for each unique > id. Is there a feasible method for doing this subset that anyone can > suggest? My actual dataset has more than 24,000 unique ids, so I am > hoping to avoid looping through this if possible. > > Thanks, > Harold > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >