Michael Friendly
2015-Mar-05 18:45 UTC
[R] subset a data frame by largest frequencies of factors
A consulting client has a large data set with a binary response (negative) and two factors (ctry and member) which have many levels, but many occur with very small frequencies. It is far too sparse with a model like glm(negative ~ ctry+member, family=binomial). > str(Dataset) 'data.frame': 10672 obs. of 5 variables: $ ctry : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 18 18 18 18 26 18 ... $ member : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19 111 120 1 1 4 55 18 ... $ negative: int 0 1 0 1 1 1 1 0 0 0 ... > For analysis, we'd like to subset the data to include only those that occur with frequency greater than a given value, or the top 10 (say) in frequency, or the highest frequency categories accounting for 80% (say) of the total. I'm not sure how to do any of these in R. Can anyone help? -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. & Chair, Quantitative Methods York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web:http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA
David L Carlson
2015-Mar-05 20:15 UTC
[R] subset a data frame by largest frequencies of factors
These two commands will compute the cell frequencies and then sort them: e <- as.data.frame(xtabs(~ctry+member, Dataset)) f <- e[order(e$Freq, decreasing=TRUE),] Then draw your subset g <- head(f, 10) or g <- f[cumsum(f$Freq)/sum(f$Freq) >.8,] Finally merge the sample with the original data and delete the unused factor levels: sample <- merge(Dataset, g[,-3]) sample$ctry <- factor(sample$ctry) sample$member <- factor(sample$member) ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Michael Friendly Sent: Thursday, March 5, 2015 12:45 PM To: R-help Subject: [R] subset a data frame by largest frequencies of factors A consulting client has a large data set with a binary response (negative) and two factors (ctry and member) which have many levels, but many occur with very small frequencies. It is far too sparse with a model like glm(negative ~ ctry+member, family=binomial). > str(Dataset) 'data.frame': 10672 obs. of 5 variables: $ ctry : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 18 18 18 18 26 18 ... $ member : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19 111 120 1 1 4 55 18 ... $ negative: int 0 1 0 1 1 1 1 0 0 0 ... > For analysis, we'd like to subset the data to include only those that occur with frequency greater than a given value, or the top 10 (say) in frequency, or the highest frequency categories accounting for 80% (say) of the total. I'm not sure how to do any of these in R. Can anyone help? -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. & Chair, Quantitative Methods York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web:http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
> -----Original Message----- > A consulting client has a large data set with a binary response > (negative) and two factors (ctry and member) which have many levels, but > many occur with very small frequencies. It is far too sparse with a model like > glm(negative ~ ctry+member, family=binomial). > > For analysis, we'd like to subset the data to include only those that occur with > frequency greater than a given valueave() helps with this kind of thing. Something like freq <- ave(1:length(ctry), factor(ctry:member), FUN=length) gives the count for each ctry:member call. Then you can subset a data frame using, for example dfr.subset <- dfr[freq>10, ] The 1:length(ctry) in the ave call is simply because ave wants a numeric there. If all we're doing with it is counting the number, it just has to be a numeric of the same length as your data. in a data frame it can be 1:nrow(dfr) etc. S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}