thr3ads.net - R help - [R] selecting rows with more than x occurrences in a given column (data type is names) [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Mike Jasper

2007-Mar-13 14:38 UTC

[R] selecting rows with more than x occurrences in a given column (data type is names)

Despite a long search on the archives, I couldn't find how to do this.
Thanks in advance for what is likely a simple issue.

I have a data set where the first column is name (i.e., 'Joe Smith',
'Jane Doe', etc). The following columns are data associated with that
person. I have many people with multiple rows. What I want is to get a
new data frame out with only the people who have more than x
occurrences in the first column.

Here's what I've done, that's not working:

Let's call my old data.frame "all.data"

table(all.data$names)>10

I get a list of names and TRUE/FALSE values. I then want to make a
list of the TRUEs and pass that to some subset type command like

dup.names=table(all.data$names)>10

new.data=(all.data[all.data$names==dup.names,])

That's not working because the dimensions are wrong (I think). But
even when I tried to do part of it manually (to troubleshoot) like
this

dup.names=c('Joe Smith','Jane Doe','etc')

I got warnings and it didn't work correctly. There must be a simple
way to do this that I'm just not seeing. Thanks.

Stephen Tucker

2007-Mar-13 14:59 UTC

head link

[R] selecting rows with more than x occurrences in a given column (data type is names)

This isn't pretty, but should work:

x <- 10 # number of occurrences
y <- split(all.data,f=all.data$names)
z <- y[unlist(lapply(y,nrow))>x]
newdata <- vector()
for( k in z ) {
  newdata <- rbind(newdata,k)
}

Basically I split your data frame into groups by name (into a list), then
selected elements in the list for which the number of rows (number of
occurrences) was > x, then concatenated rows from the selected elements to an
initially empty vector. Probably there is a more elegant way to do this but I
can't think of it at the moment...

You are correct in that the conditional statement using '==' cannot test
vectors of mismatched dimensions.





--- Mike Jasper <mikejjasper at gmail.com> wrote:
> Despite a long search on the archives, I couldn't find how to do this.
> Thanks in advance for what is likely a simple issue.
> 
> I have a data set where the first column is name (i.e., 'Joe
Smith',
> 'Jane Doe', etc). The following columns are data associated with
that
> person. I have many people with multiple rows. What I want is to get a
> new data frame out with only the people who have more than x
> occurrences in the first column.
> 
> Here's what I've done, that's not working:
> 
> Let's call my old data.frame "all.data"
> 
> table(all.data$names)>10
> 
> I get a list of names and TRUE/FALSE values. I then want to make a
> list of the TRUEs and pass that to some subset type command like
> 
> dup.names=table(all.data$names)>10
> 
> new.data=(all.data[all.data$names==dup.names,])
> 
> That's not working because the dimensions are wrong (I think). But
> even when I tried to do part of it manually (to troubleshoot) like
> this
> 
> dup.names=c('Joe Smith','Jane Doe','etc')
> 
> I got warnings and it didn't work correctly. There must be a simple
> way to do this that I'm just not seeing. Thanks.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


 
____________________________________________________________________________________
Finding fabulous fares is fun.

Dimitris Rizopoulos

2007-Mar-13 15:02 UTC

head link

[R] selecting rows with more than x occurrences in a given column(data type is names)

try this:

set.seed(123)
all.data <- data.frame(name = sample(c("Joe", "Elen",
"Jane", "Mike"),
8, TRUE),
    x = rnorm(8), y = runif(8))
##########
tab.nams <- table(all.data$name)
nams <- names(tab.nams[tab.nams >= 2])
all.data[all.data$name %in% nams, ]


I hope it helps.

Best,
Dimitris

----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
     http://www.student.kuleuven.be/~m0390867/dimitris.htm


----- Original Message ----- 
From: "Mike Jasper" <mikejjasper at gmail.com>
To: <r-help at stat.math.ethz.ch>
Sent: Tuesday, March 13, 2007 3:38 PM
Subject: [R] selecting rows with more than x occurrences in a given 
column(data type is names)

> Despite a long search on the archives, I couldn't find how to do 
> this.
> Thanks in advance for what is likely a simple issue.
>
> I have a data set where the first column is name (i.e., 'Joe
Smith',
> 'Jane Doe', etc). The following columns are data associated with 
> that
> person. I have many people with multiple rows. What I want is to get 
> a
> new data frame out with only the people who have more than x
> occurrences in the first column.
>
> Here's what I've done, that's not working:
>
> Let's call my old data.frame "all.data"
>
> table(all.data$names)>10
>
> I get a list of names and TRUE/FALSE values. I then want to make a
> list of the TRUEs and pass that to some subset type command like
>
> dup.names=table(all.data$names)>10
>
> new.data=(all.data[all.data$names==dup.names,])
>
> That's not working because the dimensions are wrong (I think). But
> even when I tried to do part of it manually (to troubleshoot) like
> this
>
> dup.names=c('Joe Smith','Jane Doe','etc')
>
> I got warnings and it didn't work correctly. There must be a simple
> way to do this that I'm just not seeing. Thanks.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

Chuck Cleland

2007-Mar-13 15:10 UTC

head link

[R] selecting rows with more than x occurrences in a given column (data type is names)

Mike Jasper wrote:> Despite a long search on the archives, I couldn't find how to do this.
> Thanks in advance for what is likely a simple issue.
> 
> I have a data set where the first column is name (i.e., 'Joe
Smith',
> 'Jane Doe', etc). The following columns are data associated with
that
> person. I have many people with multiple rows. What I want is to get a
> new data frame out with only the people who have more than x
> occurrences in the first column.
> 
> Here's what I've done, that's not working:
> 
> Let's call my old data.frame "all.data"
> 
> table(all.data$names)>10
> 
> I get a list of names and TRUE/FALSE values. I then want to make a
> list of the TRUEs and pass that to some subset type command like
> 
> dup.names=table(all.data$names)>10
> 
> new.data=(all.data[all.data$names==dup.names,])
> 
> That's not working because the dimensions are wrong (I think). But
> even when I tried to do part of it manually (to troubleshoot) like
> this
> 
> dup.names=c('Joe Smith','Jane Doe','etc')
> 
> I got warnings and it didn't work correctly. There must be a simple
> way to do this that I'm just not seeing. Thanks.
  Does this help?

df <- data.frame(PERSON =
rep(c("John","Tom","Sara","Mary"),
                              c(5,4,5,4)),
                 Y = runif(18))

subset(df, PERSON %in% names(which(table(PERSON) >= 5)))
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 512-0171 (M, W, F)
fax: (917) 438-0894

Marc Schwartz

2007-Mar-13 15:32 UTC

head link

[R] selecting rows with more than x occurrences in a given column (data type is names)

On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote:> Despite a long search on the archives, I couldn't find how to do this.
> Thanks in advance for what is likely a simple issue.
> 
> I have a data set where the first column is name (i.e., 'Joe
Smith',
> 'Jane Doe', etc). The following columns are data associated with
that
> person. I have many people with multiple rows. What I want is to get a
> new data frame out with only the people who have more than x
> occurrences in the first column.
> 
> Here's what I've done, that's not working:
> 
> Let's call my old data.frame "all.data"
> 
> table(all.data$names)>10
> 
> I get a list of names and TRUE/FALSE values. I then want to make a
> list of the TRUEs and pass that to some subset type command like
> 
> dup.names=table(all.data$names)>10
> 
> new.data=(all.data[all.data$names==dup.names,])
> 
> That's not working because the dimensions are wrong (I think). But
> even when I tried to do part of it manually (to troubleshoot) like
> this
> 
> dup.names=c('Joe Smith','Jane Doe','etc')
> 
> I got warnings and it didn't work correctly. There must be a simple
> way to do this that I'm just not seeing. Thanks.

Something like this should work:

  NewDF <- subset(all.data, names %in% unique(names[duplicated(names)]))

See ?duplicated, ?unique and ?"%in%" for more information.

HTH,

Marc Schwartz

Apparently Analagous Threads

Search for more reasonably related threads

R help - Mar 2007 - selecting rows with more than x occurrences in a given column (data type is names)

[R] selecting rows with more than x occurrences in a given column (data type is names)

[R] selecting rows with more than x occurrences in a given column (data type is names)

[R] selecting rows with more than x occurrences in a given column(data type is names)

[R] selecting rows with more than x occurrences in a given column (data type is names)

[R] selecting rows with more than x occurrences in a given column (data type is names)

Apparently Analagous Threads