Jeff Johnson
2014-Jan-14 19:38 UTC
[R] Subsetting on multiple criteria (AND condition) in R
I'm running the following to get what I would expect is a subset of countries that are not equal to "US" AND COUNTRY is not in one of my validcountries values. non_us <- subset(mydf, (COUNTRY %in% validcountries) & COUNTRY != "US", select = COUNTRY, na.rm=TRUE) however, when I then do table(non_us) I get:> table(non_us)non_us AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 2 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 3 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3>Notice US appears as the second to last. I expected it to NOT appear. Do you know if I'm using incorrect syntax? Is the & symbol equivalent to AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != "US" valid syntax? I don't get errors, but then again I don't get what I expect back. Thanks in advance! -- Jeff [[alternative HTML version deleted]]
Hi, Try: table(as.character(non_us[,"COUNTRY"])) A.K. On Tuesday, January 14, 2014 3:17 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote: I'm running the following to get what I would expect is a subset of countries that are not equal to "US" AND COUNTRY is not in one of my validcountries values. non_us <- subset(mydf, (COUNTRY %in% validcountries) & COUNTRY != "US", select = COUNTRY, na.rm=TRUE) however, when I then do table(non_us) I get:> table(non_us)non_us ? AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO EC ES 0? 3? 0? 2? 1 31? 4? 1? 1? 1 45? 1? 1? 4? 5 86? 3? 1? 8? 1? 2? 1? 8? 2? 1 2? 4 FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO NZ PA 2? 4 35? 3? 3 14? 3? 5? 2? 5? 1? 2? 1 15? 1 11? 2? 2? 1? 1 23? 7? 1? 6? 1 3? 1 PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA 2? 1? 1? 8? 1? 1? 1? 1? 1 18? 1? 1? 2 11? 1? 0? 3>Notice US appears as the second to last. I expected it to NOT appear. Do you know if I'm using incorrect syntax? Is the & symbol equivalent to AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != "US" valid syntax? I don't get errors, but then again I don't get what I expect back. Thanks in advance! -- Jeff ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Marc Schwartz
2014-Jan-14 21:05 UTC
[R] Subsetting on multiple criteria (AND condition) in R
On Jan 14, 2014, at 1:38 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:> I'm running the following to get what I would expect is a subset of > countries that are not equal to "US" AND COUNTRY is not in one of my > validcountries values. > > non_us <- subset(mydf, (COUNTRY %in% validcountries) & COUNTRY != "US", > select = COUNTRY, na.rm=TRUE) > > however, when I then do table(non_us) I get: >> table(non_us) > non_us > AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO > EC ES > 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 > 2 4 > FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO > NZ PA > 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 > 3 1 > PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA > 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 >> > > Notice US appears as the second to last. I expected it to NOT appear. > > Do you know if I'm using incorrect syntax? Is the & symbol equivalent to > AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != "US" > valid syntax? I don't get errors, but then again I don't get what I expect > back. > > Thanks in advance! > > > > -- > JeffReview the Details section of ?subset, where you will find the following: "Factors may have empty levels after subsetting; unused levels are not automatically removed. See droplevels for a way to drop all unused levels from a data frame." Your syntax is fine and the behavior is as expected. Regards, Marc Schwartz
William Dunlap
2014-Jan-14 21:29 UTC
[R] Subsetting on multiple criteria (AND condition) in R
Here is a reproducible example of your problem where you do not want to see a table entry for "Medium". > tmp_df <- data.frame(Size=factor(rep(c("Small","Medium","Large"),1:3), levels=c("Small","Medium","Large"))) > non_medium <- subset(tmp_df, Size != "Medium", select=Size) > table(non_medium) non_medium Small Medium Large 1 0 3 The problem arises because, by default, when you take a subset of a factor all the levels of the factor are retained and table(factor) makes an entry for every level. If you want to drop the unused levels in a factor (and retain the order of the remaining levels) you can pass it through the factor function: > table(Size=factor(non_medium$Size)) Size Small Large 1 3 You can also subset the factor with the drop=TRUE argument to drop the unused levels when you make the subset > table(Size=tmp_df$Size[tmp_df$Size != "Medium", drop=TRUE]) Size Small Large 1 3 Some will say to use as.character on the factor or not to use factors at all. That works if you are OK with the entries in the table being in alphabetic order and not a semantic order of your choosing. Bill Dunlap TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Jeff Johnson > Sent: Tuesday, January 14, 2014 11:39 AM > To: r-help at r-project.org > Subject: [R] Subsetting on multiple criteria (AND condition) in R > > I'm running the following to get what I would expect is a subset of > countries that are not equal to "US" AND COUNTRY is not in one of my > validcountries values. > > non_us <- subset(mydf, (COUNTRY %in% validcountries) & COUNTRY != "US", > select = COUNTRY, na.rm=TRUE) > > however, when I then do table(non_us) I get: > > table(non_us) > non_us > AE AN AR AT AU BB BD BE BH BM BN BO BR BS CA CH CM CN CO CR CY DE DK DO > EC ES > 0 3 0 2 1 31 4 1 1 1 45 1 1 4 5 86 3 1 8 1 2 1 8 2 1 > 2 4 > FI FR GB GR GU HK ID IE IL IN IO IT JM JP KH KR KY LU LV MO MX MY NG NL NO > NZ PA > 2 4 35 3 3 14 3 5 2 5 1 2 1 15 1 11 2 2 1 1 23 7 1 6 1 > 3 1 > PE PG PH PR PT RO RU SA SE SG TC TH TT TW TZ US ZA > 2 1 1 8 1 1 1 1 1 18 1 1 2 11 1 0 3 > > > > Notice US appears as the second to last. I expected it to NOT appear. > > Do you know if I'm using incorrect syntax? Is the & symbol equivalent to > AND (notice I have 2 criteria for subsetting)? Also, is COUNTRY != "US" > valid syntax? I don't get errors, but then again I don't get what I expect > back. > > Thanks in advance! > > > > -- > Jeff > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.