Hi,
May be you should try ?data.table().?
Please use ?dput().
dat1<- read.table(text="
Area Sex Year y
Bob F 2011 1
Bob F 2011 2
Bob F 2012 3
Bob M 2012 3
Bob M 2012 2
Fred F 2011 1
Fred F 2011 1
Fred F 2012 2
Fred M 2012 3
Fred M 2012 1
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(data.table)
?dt2<-dt1[,sum(y),by=list(Area,Sex,Year)]
?dt2
#?? Area Sex Year V1
#1:? Bob?? F 2011? 3
#2:? Bob?? F 2012? 3
#3:? Bob?? M 2012? 5
#4: Fred?? F 2011? 2
#5: Fred?? F 2012? 2
#6: Fred?? M 2012? 4
#Speed
set.seed(28)
dat2<-
data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE))
system.time(datTest<- aggregate(y~.,data=dat2,sum))
#?? user? system elapsed
# 18.056?? 1.336? 19.424
datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),]
row.names(datTest2)<- 1:nrow(datTest2)
dtTest<- data.table(dat2)
?system.time({
?setkey(dtTest,Area,Sex,Year)
dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]})
# user? system elapsed
#? 1.232?? 0.184?? 1.418
?setnames(dtTest2,"V1","y")
identical(datTest2,as.data.frame(dtTest2))
#[1] TRUE
A.K.
----- Original Message -----
From: Michael Liaw <michael.liaw at hotmail.com>
To: r-help at r-project.org
Cc:
Sent: Saturday, August 3, 2013 8:11 PM
Subject: [R] Group by a data frame with multiple columns
Hi
I'm trying to manipulate a data frame (that has about 10 million rows) rows
by "grouping" it with multiple columns. For example, say the data set
looks
like:
Area
Sex
Year
y
Bob
F
2011
1
Bob
F
2011
2
Bob
F
2012
3
Bob
M
2012
3
Bob
M
2012
2
Fred
F
2011
1
Fred
F
2011
1
Fred
F
2012
2
Fred
M
2012
3
Fred
M
2012
1
And I want it to look like
Area
Sex
Year
Sum of y
Bob
F
2011
3
Bob
F
2012
3
Bob
M
2012
5
Fred
F
2011
2
Fred
F
2012
2
Fred
M
2012
4
I think I can use something like:
tmp <- aggregate (y ~ ., sum)
But due to the size it's really taking a strain on the computer (even with
64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :().? The
reason for me wanting the data set to get into this form is I want to then
apply the population information and get the "rate" on the "sum
of y" column
then fit a Poisson regression model.
I'm wondering (and would appreciate comments) whether there is a more
efficient way to the process I described?
Cheers
Michael
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.