Mike Jasper
2007-Mar-13 14:38 UTC
[R] selecting rows with more than x occurrences in a given column (data type is names)
Despite a long search on the archives, I couldn't find how to do this. Thanks in advance for what is likely a simple issue. I have a data set where the first column is name (i.e., 'Joe Smith', 'Jane Doe', etc). The following columns are data associated with that person. I have many people with multiple rows. What I want is to get a new data frame out with only the people who have more than x occurrences in the first column. Here's what I've done, that's not working: Let's call my old data.frame "all.data" table(all.data$names)>10 I get a list of names and TRUE/FALSE values. I then want to make a list of the TRUEs and pass that to some subset type command like dup.names=table(all.data$names)>10 new.data=(all.data[all.data$names==dup.names,]) That's not working because the dimensions are wrong (I think). But even when I tried to do part of it manually (to troubleshoot) like this dup.names=c('Joe Smith','Jane Doe','etc') I got warnings and it didn't work correctly. There must be a simple way to do this that I'm just not seeing. Thanks.
Stephen Tucker
2007-Mar-13 14:59 UTC
[R] selecting rows with more than x occurrences in a given column (data type is names)
This isn't pretty, but should work: x <- 10 # number of occurrences y <- split(all.data,f=all.data$names) z <- y[unlist(lapply(y,nrow))>x] newdata <- vector() for( k in z ) { newdata <- rbind(newdata,k) } Basically I split your data frame into groups by name (into a list), then selected elements in the list for which the number of rows (number of occurrences) was > x, then concatenated rows from the selected elements to an initially empty vector. Probably there is a more elegant way to do this but I can't think of it at the moment... You are correct in that the conditional statement using '==' cannot test vectors of mismatched dimensions. --- Mike Jasper <mikejjasper at gmail.com> wrote:> Despite a long search on the archives, I couldn't find how to do this. > Thanks in advance for what is likely a simple issue. > > I have a data set where the first column is name (i.e., 'Joe Smith', > 'Jane Doe', etc). The following columns are data associated with that > person. I have many people with multiple rows. What I want is to get a > new data frame out with only the people who have more than x > occurrences in the first column. > > Here's what I've done, that's not working: > > Let's call my old data.frame "all.data" > > table(all.data$names)>10 > > I get a list of names and TRUE/FALSE values. I then want to make a > list of the TRUEs and pass that to some subset type command like > > dup.names=table(all.data$names)>10 > > new.data=(all.data[all.data$names==dup.names,]) > > That's not working because the dimensions are wrong (I think). But > even when I tried to do part of it manually (to troubleshoot) like > this > > dup.names=c('Joe Smith','Jane Doe','etc') > > I got warnings and it didn't work correctly. There must be a simple > way to do this that I'm just not seeing. Thanks. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >____________________________________________________________________________________ Finding fabulous fares is fun.
Dimitris Rizopoulos
2007-Mar-13 15:02 UTC
[R] selecting rows with more than x occurrences in a given column(data type is names)
try this: set.seed(123) all.data <- data.frame(name = sample(c("Joe", "Elen", "Jane", "Mike"), 8, TRUE), x = rnorm(8), y = runif(8)) ########## tab.nams <- table(all.data$name) nams <- names(tab.nams[tab.nams >= 2]) all.data[all.data$name %in% nams, ] I hope it helps. Best, Dimitris ---- Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm ----- Original Message ----- From: "Mike Jasper" <mikejjasper at gmail.com> To: <r-help at stat.math.ethz.ch> Sent: Tuesday, March 13, 2007 3:38 PM Subject: [R] selecting rows with more than x occurrences in a given column(data type is names)> Despite a long search on the archives, I couldn't find how to do > this. > Thanks in advance for what is likely a simple issue. > > I have a data set where the first column is name (i.e., 'Joe Smith', > 'Jane Doe', etc). The following columns are data associated with > that > person. I have many people with multiple rows. What I want is to get > a > new data frame out with only the people who have more than x > occurrences in the first column. > > Here's what I've done, that's not working: > > Let's call my old data.frame "all.data" > > table(all.data$names)>10 > > I get a list of names and TRUE/FALSE values. I then want to make a > list of the TRUEs and pass that to some subset type command like > > dup.names=table(all.data$names)>10 > > new.data=(all.data[all.data$names==dup.names,]) > > That's not working because the dimensions are wrong (I think). But > even when I tried to do part of it manually (to troubleshoot) like > this > > dup.names=c('Joe Smith','Jane Doe','etc') > > I got warnings and it didn't work correctly. There must be a simple > way to do this that I'm just not seeing. Thanks. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
Chuck Cleland
2007-Mar-13 15:10 UTC
[R] selecting rows with more than x occurrences in a given column (data type is names)
Mike Jasper wrote:> Despite a long search on the archives, I couldn't find how to do this. > Thanks in advance for what is likely a simple issue. > > I have a data set where the first column is name (i.e., 'Joe Smith', > 'Jane Doe', etc). The following columns are data associated with that > person. I have many people with multiple rows. What I want is to get a > new data frame out with only the people who have more than x > occurrences in the first column. > > Here's what I've done, that's not working: > > Let's call my old data.frame "all.data" > > table(all.data$names)>10 > > I get a list of names and TRUE/FALSE values. I then want to make a > list of the TRUEs and pass that to some subset type command like > > dup.names=table(all.data$names)>10 > > new.data=(all.data[all.data$names==dup.names,]) > > That's not working because the dimensions are wrong (I think). But > even when I tried to do part of it manually (to troubleshoot) like > this > > dup.names=c('Joe Smith','Jane Doe','etc') > > I got warnings and it didn't work correctly. There must be a simple > way to do this that I'm just not seeing. Thanks.Does this help? df <- data.frame(PERSON = rep(c("John","Tom","Sara","Mary"), c(5,4,5,4)), Y = runif(18)) subset(df, PERSON %in% names(which(table(PERSON) >= 5)))> ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 512-0171 (M, W, F) fax: (917) 438-0894
Marc Schwartz
2007-Mar-13 15:32 UTC
[R] selecting rows with more than x occurrences in a given column (data type is names)
On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote:> Despite a long search on the archives, I couldn't find how to do this. > Thanks in advance for what is likely a simple issue. > > I have a data set where the first column is name (i.e., 'Joe Smith', > 'Jane Doe', etc). The following columns are data associated with that > person. I have many people with multiple rows. What I want is to get a > new data frame out with only the people who have more than x > occurrences in the first column. > > Here's what I've done, that's not working: > > Let's call my old data.frame "all.data" > > table(all.data$names)>10 > > I get a list of names and TRUE/FALSE values. I then want to make a > list of the TRUEs and pass that to some subset type command like > > dup.names=table(all.data$names)>10 > > new.data=(all.data[all.data$names==dup.names,]) > > That's not working because the dimensions are wrong (I think). But > even when I tried to do part of it manually (to troubleshoot) like > this > > dup.names=c('Joe Smith','Jane Doe','etc') > > I got warnings and it didn't work correctly. There must be a simple > way to do this that I'm just not seeing. Thanks.Something like this should work: NewDF <- subset(all.data, names %in% unique(names[duplicated(names)])) See ?duplicated, ?unique and ?"%in%" for more information. HTH, Marc Schwartz