Stodola, Kirk
2012-Nov-29 16:32 UTC
[R] Deleting certain observations (and their imprint?)
I'm manipulating a large dataset and need to eliminate some observations based on specific identifiers. This isn't a problem in and of itself (using which.. or subset..) but an imprint of the deleted observations seem to remain, even though they have 0 observations. This is causing me problems later on. I'll use the dataset warpbreaks to illustrate, I apologize if this isn't in the best format ##Summary of warpbreaks suggests three tension levels (H, M, L)> summary(warpbreaks)breaks wool tension Min. :10.00 A:27 L:18 1st Qu.:18.25 B:27 M:18 Median :26.00 H:18 Mean :28.15 3rd Qu.:34.00 Max. :70.00 ## Subset the dataset and keep only those observations with "L"> wb.subset <- warpbreaks[which(warpbreaks$tension=="L"),]##Summary of the subsetted data shows: L=18, M=0, H=0, Why is M and H still included?> summary(wb.subset)breaks wool tension Min. :14.00 A:9 L:18 1st Qu.:26.00 B:9 M: 0 Median :29.50 H: 0 Mean :36.39 3rd Qu.:49.25 Max. :70.00 ##The subsetted dataset does not show M or H> wb.subsetIs there a way that M & H can be completely eliminated (i.e. they don't show up in summary)? The only way I found was to export the dataset and then reimport, which seems pretty cumbersome. Thanks in advance for any help. -Kirk
Hi Kirk, It's because tension is a factor with three levels, as you could see with str(warpbreaks). Factors are one of the mysteries of R that distinguish a novice from an initiate. Reading ?subset directs you to ?droplevels. Here's an example:> summary(warpbreaks)breaks wool tension Min. :10.00 A:27 L:18 1st Qu.:18.25 B:27 M:18 Median :26.00 H:18 Mean :28.15 3rd Qu.:34.00 Max. :70.00> str(warpbreaks)'data.frame': 54 obs. of 3 variables: $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...> ?subset > wb.subset <- warpbreaks[which(warpbreaks$tension=="L"),] > summary(wb.subset)breaks wool tension Min. :14.00 A:9 L:18 1st Qu.:26.00 B:9 M: 0 Median :29.50 H: 0 Mean :36.39 3rd Qu.:49.25 Max. :70.00> wb.subset <- droplevels(wb.subset) > summary(wb.subset)breaks wool tension Min. :14.00 A:9 L:18 1st Qu.:26.00 B:9 Median :29.50 Mean :36.39 3rd Qu.:49.25 Max. :70.00>Sarah On Thu, Nov 29, 2012 at 11:32 AM, Stodola, Kirk <kstodola at illinois.edu> wrote:> I'm manipulating a large dataset and need to eliminate some observations based on specific identifiers. This isn't a problem in and of itself (using which.. or subset..) but an imprint of the deleted observations seem to remain, even though they have 0 observations. This is causing me problems later on. I'll use the dataset warpbreaks to illustrate, I apologize if this isn't in the best format > > ##Summary of warpbreaks suggests three tension levels (H, M, L) >> summary(warpbreaks) > > breaks wool tension > Min. :10.00 A:27 L:18 > 1st Qu.:18.25 B:27 M:18 > Median :26.00 H:18 > Mean :28.15 > 3rd Qu.:34.00 > Max. :70.00 > > ## Subset the dataset and keep only those observations with "L" >> wb.subset <- warpbreaks[which(warpbreaks$tension=="L"),] > > > ##Summary of the subsetted data shows: L=18, M=0, H=0, Why is M and H still included? >> summary(wb.subset) > > breaks wool tension > Min. :14.00 A:9 L:18 > 1st Qu.:26.00 B:9 M: 0 > Median :29.50 H: 0 > Mean :36.39 > 3rd Qu.:49.25 > Max. :70.00 > > ##The subsetted dataset does not show M or H >> wb.subset > > Is there a way that M & H can be completely eliminated (i.e. they don't show up in summary)? The only way I found was to export the dataset and then reimport, which seems pretty cumbersome. Thanks in advance for any help. -Kirk >-- Sarah Goslee http://www.functionaldiversity.org
Hi> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Stodola, Kirk > Sent: Thursday, November 29, 2012 5:32 PM > To: r-help at r-project.org > Subject: [R] Deleting certain observations (and their imprint?) > > I'm manipulating a large dataset and need to eliminate some > observations based on specific identifiers. This isn't a problem in > and of itself (using which.. or subset..) but an imprint of the deleted > observations seem to remain, even though they have 0 observations. > This is causing me problems later on. I'll use the dataset warpbreaks > to illustrate, I apologize if this isn't in the best format > > ##Summary of warpbreaks suggests three tension levels (H, M, L) > > summary(warpbreaks) > > breaks wool tension > Min. :10.00 A:27 L:18 > 1st Qu.:18.25 B:27 M:18 > Median :26.00 H:18 > Mean :28.15 > 3rd Qu.:34.00 > Max. :70.00 > > ## Subset the dataset and keep only those observations with "L" > > wb.subset <- warpbreaks[which(warpbreaks$tension=="L"),]wb.subset <- warpbreaks[which(warpbreaks$tension=="L"), , drop=TRUE] or warpbreaks$tension <- factor(warpbreaks$tension) or change tension from factor to character vector. Regards Petr> > > ##Summary of the subsetted data shows: L=18, M=0, H=0, Why is M and H > still included? > > summary(wb.subset) > > breaks wool tension > Min. :14.00 A:9 L:18 > 1st Qu.:26.00 B:9 M: 0 > Median :29.50 H: 0 > Mean :36.39 > 3rd Qu.:49.25 > Max. :70.00 > > ##The subsetted dataset does not show M or H > > wb.subset > > Is there a way that M & H can be completely eliminated (i.e. they don't > show up in summary)? The only way I found was to export the dataset and > then reimport, which seems pretty cumbersome. Thanks in advance for > any help. -Kirk > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.