Hi,
I am trying to aggregate the sum of my test data.frame as follow:
testDF <- data.frame(v1 = c("a", "a", "a",
"a", "a", "b", "b", "b",
"b",
"b", "c", "c", "c", "c",
"c", "d", "d", "d", "d",
"d"),
v2 = c("a1", "a1", "a1",
"a2", "a3", "b1", "b1", "b2",
"b2", "b2", "c1", "c2", "c2",
"c2", "c3", "d1", "d2", "d3",
"d4", "d4"),
n1 = 1:20,
n2 = 21:40 )
testDF <- orderBy( ~ v1+v2, data = testDF)
rownames(testDF) <- NULL
> testDF
v1 v2 n1 n2
1 a a1 1 21
2 a a1 2 22
3 a a1 3 23
4 a a2 4 24
5 a a3 5 25
6 b b1 6 26
7 b b1 7 27
8 b b2 8 28
9 b b2 9 29
10 b b2 10 30
11 c c1 11 31
12 c c2 12 32
13 c c2 13 33
14 c c2 14 34
15 c c3 15 35
16 d d1 16 36
17 d d2 17 37
18 d d3 18 38
19 d d4 19 39
20 d d4 20 40>
testDF.result <- aggregate(list(testDF$n1, testDF$n2), by = list(testDF$v1,
testDF$v2), FUN = sum)
> testDF.result
Group.1 Group.2 X1.20 X21.40
1 a a1 6 66
2 a a2 4 24
3 a a3 5 25
4 b b1 13 53
5 b b2 27 87
6 c c1 11 31
7 c c2 39 99
8 c c3 15 35
9 d d1 16 36
10 d d2 17 37
11 d d3 18 38
12 d d4 39 79>
However, when I applied it to my real data, it failed. It seems that
aggregate require more memory that I have currently (I am using WinXP,
R2.8.0, 2GB RAM).
Basically I want to perform aggregate sum on my numeric fields (in the above
case, n1 and n2) based on condition of v1 and v2.
Problem is, I have a lot more of than just two numerics and conditioning
fields.
In SQL, I would do:
select v1, v2, sum(n1), sum(n2) from myData
group by v1, v2;
Am I using a wrong function / library (or even wrong approach)? If so, can
you suggest which one?
Any pointer is really appreciated.
Thanks,
Ferry
[[alternative HTML version deleted]]
Hi, I think I found the solution. Using doBy library, I got: testDF.result2 <- summaryBy(n1+n2 ~ v1+v2, data = testDF, FUN=sum)> testDF.result2v1 v2 n1.sum n2.sum 1 a a1 6 66 2 a a2 4 24 3 a a3 5 25 4 b b1 13 53 5 b b2 27 87 6 c c1 11 31 7 c c2 39 99 8 c c3 15 35 9 d d1 16 36 10 d d2 17 37 11 d d3 18 38 12 d d4 39 79 In any case, did I do something wrong using the aggregate function? Thanks, Ferry On Thu, Mar 19, 2009 at 6:09 PM, Ferry <fmi.mlist@gmail.com> wrote:> Hi, > > I am trying to aggregate the sum of my test data.frame as follow: > > testDF <- data.frame(v1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b", > "b", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d"), > v2 = c("a1", "a1", "a1", "a2", "a3", "b1", "b1", "b2", > "b2", "b2", "c1", "c2", "c2", "c2", "c3", "d1", "d2", "d3", "d4", "d4"), > n1 = 1:20, > n2 = 21:40 ) > > testDF <- orderBy( ~ v1+v2, data = testDF) > rownames(testDF) <- NULL > > > testDF > v1 v2 n1 n2 > 1 a a1 1 21 > 2 a a1 2 22 > 3 a a1 3 23 > 4 a a2 4 24 > 5 a a3 5 25 > 6 b b1 6 26 > 7 b b1 7 27 > 8 b b2 8 28 > 9 b b2 9 29 > 10 b b2 10 30 > 11 c c1 11 31 > 12 c c2 12 32 > 13 c c2 13 33 > 14 c c2 14 34 > 15 c c3 15 35 > 16 d d1 16 36 > 17 d d2 17 37 > 18 d d3 18 38 > 19 d d4 19 39 > 20 d d4 20 40 > > > > testDF.result <- aggregate(list(testDF$n1, testDF$n2), by = list(testDF$v1, > testDF$v2), FUN = sum) > > > testDF.result > Group.1 Group.2 X1.20 X21.40 > 1 a a1 6 66 > 2 a a2 4 24 > 3 a a3 5 25 > 4 b b1 13 53 > 5 b b2 27 87 > 6 c c1 11 31 > 7 c c2 39 99 > 8 c c3 15 35 > 9 d d1 16 36 > 10 d d2 17 37 > 11 d d3 18 38 > 12 d d4 39 79 > > > > > However, when I applied it to my real data, it failed. It seems that > aggregate require more memory that I have currently (I am using WinXP, > R2.8.0, 2GB RAM). > > Basically I want to perform aggregate sum on my numeric fields (in the > above case, n1 and n2) based on condition of v1 and v2. > > Problem is, I have a lot more of than just two numerics and conditioning > fields. > > In SQL, I would do: > select v1, v2, sum(n1), sum(n2) from myData > group by v1, v2; > > Am I using a wrong function / library (or even wrong approach)? If so, can > you suggest which one? > > Any pointer is really appreciated. > > Thanks, > > Ferry >[[alternative HTML version deleted]]
Here are two solutions:> aggregate(testDF[c("n1", "n2")], testDF[c("v1", "v2")], sum)v1 v2 n1 n2 1 a a1 6 66 2 a a2 4 24 3 a a3 5 25 4 b b1 13 53 5 b b2 27 87 6 c c1 11 31 7 c c2 39 99 8 c c3 15 35 9 d d1 16 36 10 d d2 17 37 11 d d3 18 38 12 d d4 39 79> library(sqldf) > sqldf("select v1, v2, sum(n1), sum(n2) from testDF group by v1, v2")v1 v2 sum(n1) sum(n2) 1 a a1 6 66 2 a a2 4 24 3 a a3 5 25 4 b b1 13 53 5 b b2 27 87 6 c c1 11 31 7 c c2 39 99 8 c c3 15 35 9 d d1 16 36 10 d d2 17 37 11 d d3 18 38 12 d d4 39 79 On Thu, Mar 19, 2009 at 9:18 PM, Ferry <fmi.mlist at gmail.com> wrote:> Hi, > > I think I found the solution. > Using doBy library, I got: > > testDF.result2 <- summaryBy(n1+n2 ~ v1+v2, data = testDF, FUN=sum) > >> testDF.result2 > ? v1 v2 n1.sum n2.sum > 1 ? a a1 ? ? ?6 ? ? 66 > 2 ? a a2 ? ? ?4 ? ? 24 > 3 ? a a3 ? ? ?5 ? ? 25 > 4 ? b b1 ? ? 13 ? ? 53 > 5 ? b b2 ? ? 27 ? ? 87 > 6 ? c c1 ? ? 11 ? ? 31 > 7 ? c c2 ? ? 39 ? ? 99 > 8 ? c c3 ? ? 15 ? ? 35 > 9 ? d d1 ? ? 16 ? ? 36 > 10 ?d d2 ? ? 17 ? ? 37 > 11 ?d d3 ? ? 18 ? ? 38 > 12 ?d d4 ? ? 39 ? ? 79 > > In any case, did I do something wrong using the aggregate function? > > Thanks, > > Ferry > > On Thu, Mar 19, 2009 at 6:09 PM, Ferry <fmi.mlist at gmail.com> wrote: > >> Hi, >> >> I am trying to aggregate the sum of my test data.frame as follow: >> >> testDF <- data.frame(v1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b", >> "b", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d"), >> ? ? ? ? ? ? ? ? ? ? ?v2 = c("a1", "a1", "a1", "a2", "a3", "b1", "b1", "b2", >> "b2", "b2", "c1", "c2", "c2", "c2", "c3", "d1", "d2", "d3", "d4", "d4"), >> ? ? ? ? ? ? ? ? ? ? ?n1 = 1:20, >> ? ? ? ? ? ? ? ? ? ? ?n2 = 21:40 ) >> >> testDF <- orderBy( ~ v1+v2, data = testDF) >> rownames(testDF) <- NULL >> >> > testDF >> ? ?v1 v2 n1 n2 >> 1 ? a a1 ?1 21 >> 2 ? a a1 ?2 22 >> 3 ? a a1 ?3 23 >> 4 ? a a2 ?4 24 >> 5 ? a a3 ?5 25 >> 6 ? b b1 ?6 26 >> 7 ? b b1 ?7 27 >> 8 ? b b2 ?8 28 >> 9 ? b b2 ?9 29 >> 10 ?b b2 10 30 >> 11 ?c c1 11 31 >> 12 ?c c2 12 32 >> 13 ?c c2 13 33 >> 14 ?c c2 14 34 >> 15 ?c c3 15 35 >> 16 ?d d1 16 36 >> 17 ?d d2 17 37 >> 18 ?d d3 18 38 >> 19 ?d d4 19 39 >> 20 ?d d4 20 40 >> > >> >> testDF.result <- aggregate(list(testDF$n1, testDF$n2), by = list(testDF$v1, >> testDF$v2), FUN = sum) >> >> > testDF.result >> ? ?Group.1 Group.2 X1.20 X21.40 >> 1 ? ? ? ?a ? ? ?a1 ? ? 6 ? ? 66 >> 2 ? ? ? ?a ? ? ?a2 ? ? 4 ? ? 24 >> 3 ? ? ? ?a ? ? ?a3 ? ? 5 ? ? 25 >> 4 ? ? ? ?b ? ? ?b1 ? ?13 ? ? 53 >> 5 ? ? ? ?b ? ? ?b2 ? ?27 ? ? 87 >> 6 ? ? ? ?c ? ? ?c1 ? ?11 ? ? 31 >> 7 ? ? ? ?c ? ? ?c2 ? ?39 ? ? 99 >> 8 ? ? ? ?c ? ? ?c3 ? ?15 ? ? 35 >> 9 ? ? ? ?d ? ? ?d1 ? ?16 ? ? 36 >> 10 ? ? ? d ? ? ?d2 ? ?17 ? ? 37 >> 11 ? ? ? d ? ? ?d3 ? ?18 ? ? 38 >> 12 ? ? ? d ? ? ?d4 ? ?39 ? ? 79 >> > >> >> >> However, when I applied it to my real data, it failed. It seems that >> aggregate require more memory that I have currently (I am using WinXP, >> R2.8.0, 2GB RAM). >> >> Basically I want to perform aggregate sum on my numeric fields (in the >> above case, n1 and n2) based on condition of v1 and v2. >> >> Problem is, I have a lot more of than just two numerics and conditioning >> fields. >> >> In SQL, I would do: >> select v1, v2, sum(n1), sum(n2) from myData >> group by v1, v2; >> >> Am I using a wrong function / library (or even wrong approach)? If so, can >> you suggest which one? >> >> Any pointer is really appreciated. >> >> Thanks, >> >> Ferry >> > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Try this technique. I use it with large data objects since it is
sometime faster, and uses less memory, by using indices:
x <- read.table(textConnection(" v1 v2 n1 n2
1 a a1 1 21
2 a a1 2 22
3 a a1 3 23
4 a a2 4 24
5 a a3 5 25
6 b b1 6 26
7 b b1 7 27
8 b b2 8 28
9 b b2 9 29
10 b b2 10 30
11 c c1 11 31
12 c c2 12 32
13 c c2 13 33
14 c c2 14 34
15 c c3 15 35
16 d d1 16 36
17 d d2 17 37
18 d d3 18 38
19 d d4 19 39
20 d d4 20 40"), header=TRUE)
closeAllConnections()
# use indices to reduce memory
x.ind <- split(seq(nrow(x)), list(x$v1, x$v2), drop=TRUE)
# now aggregate using the indices
x.agg <- do.call(rbind, lapply(x.ind, function(.seg){
data.frame(v1=x$v1[.seg[1]], v2=x$v2[.seg[1]],
n1=sum(x$n1[.seg]), n2=sum(x$n2[.seg]))
}))
On Thu, Mar 19, 2009 at 9:09 PM, Ferry <fmi.mlist at gmail.com>
wrote:> Hi,
>
> I am trying to aggregate the sum of my test data.frame as follow:
>
> testDF <- data.frame(v1 = c("a", "a", "a",
"a", "a", "b", "b", "b",
"b",
> "b", "c", "c", "c", "c",
"c", "d", "d", "d", "d",
"d"),
> ? ? ? ? ? ? ? ? ? ? v2 = c("a1", "a1", "a1",
"a2", "a3", "b1", "b1", "b2",
> "b2", "b2", "c1", "c2",
"c2", "c2", "c3", "d1", "d2",
"d3", "d4", "d4"),
> ? ? ? ? ? ? ? ? ? ? n1 = 1:20,
> ? ? ? ? ? ? ? ? ? ? n2 = 21:40 )
>
> testDF <- orderBy( ~ v1+v2, data = testDF)
> rownames(testDF) <- NULL
>
>> testDF
> ? v1 v2 n1 n2
> 1 ? a a1 ?1 21
> 2 ? a a1 ?2 22
> 3 ? a a1 ?3 23
> 4 ? a a2 ?4 24
> 5 ? a a3 ?5 25
> 6 ? b b1 ?6 26
> 7 ? b b1 ?7 27
> 8 ? b b2 ?8 28
> 9 ? b b2 ?9 29
> 10 ?b b2 10 30
> 11 ?c c1 11 31
> 12 ?c c2 12 32
> 13 ?c c2 13 33
> 14 ?c c2 14 34
> 15 ?c c3 15 35
> 16 ?d d1 16 36
> 17 ?d d2 17 37
> 18 ?d d3 18 38
> 19 ?d d4 19 39
> 20 ?d d4 20 40
>>
>
> testDF.result <- aggregate(list(testDF$n1, testDF$n2), by =
list(testDF$v1,
> testDF$v2), FUN = sum)
>
>> testDF.result
> ? Group.1 Group.2 X1.20 X21.40
> 1 ? ? ? ?a ? ? ?a1 ? ? 6 ? ? 66
> 2 ? ? ? ?a ? ? ?a2 ? ? 4 ? ? 24
> 3 ? ? ? ?a ? ? ?a3 ? ? 5 ? ? 25
> 4 ? ? ? ?b ? ? ?b1 ? ?13 ? ? 53
> 5 ? ? ? ?b ? ? ?b2 ? ?27 ? ? 87
> 6 ? ? ? ?c ? ? ?c1 ? ?11 ? ? 31
> 7 ? ? ? ?c ? ? ?c2 ? ?39 ? ? 99
> 8 ? ? ? ?c ? ? ?c3 ? ?15 ? ? 35
> 9 ? ? ? ?d ? ? ?d1 ? ?16 ? ? 36
> 10 ? ? ? d ? ? ?d2 ? ?17 ? ? 37
> 11 ? ? ? d ? ? ?d3 ? ?18 ? ? 38
> 12 ? ? ? d ? ? ?d4 ? ?39 ? ? 79
>>
>
>
> However, when I applied it to my real data, it failed. It seems that
> aggregate require more memory that I have currently (I am using WinXP,
> R2.8.0, 2GB RAM).
>
> Basically I want to perform aggregate sum on my numeric fields (in the
above
> case, n1 and n2) based on condition of v1 and v2.
>
> Problem is, I have a lot more of than just two numerics and conditioning
> fields.
>
> In SQL, I would do:
> select v1, v2, sum(n1), sum(n2) from myData
> group by v1, v2;
>
> Am I using a wrong function / library (or even wrong approach)? If so, can
> you suggest which one?
>
> Any pointer is really appreciated.
>
> Thanks,
>
> Ferry
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
On Thu, Mar 19, 2009 at 8:40 PM, jim holtman <jholtman at gmail.com> wrote:> Try this technique. ?I use it with large data objects since it is > sometime faster, and uses less memory, by using indices: > > x <- read.table(textConnection(" ?v1 v2 n1 n2 > 1 ? a a1 ?1 21 > 2 ? a a1 ?2 22 > 3 ? a a1 ?3 23 > 4 ? a a2 ?4 24 > 5 ? a a3 ?5 25 > 6 ? b b1 ?6 26 > 7 ? b b1 ?7 27 > 8 ? b b2 ?8 28 > 9 ? b b2 ?9 29 > 10 ?b b2 10 30 > 11 ?c c1 11 31 > 12 ?c c2 12 32 > 13 ?c c2 13 33 > 14 ?c c2 14 34 > 15 ?c c3 15 35 > 16 ?d d1 16 36 > 17 ?d d2 17 37 > 18 ?d d3 18 38 > 19 ?d d4 19 39 > 20 ?d d4 20 40"), header=TRUE) > closeAllConnections() > # use indices to reduce memory > x.ind <- split(seq(nrow(x)), list(x$v1, x$v2), drop=TRUE) > # now aggregate using the indices > x.agg <- do.call(rbind, lapply(x.ind, function(.seg){ > ? ?data.frame(v1=x$v1[.seg[1]], v2=x$v2[.seg[1]], > ? ? ? ?n1=sum(x$n1[.seg]), n2=sum(x$n2[.seg])) > }))This is basically the approach that the plyr package, http://had.co.nz/plyr, uses behind a user-friendly interface. Hadley -- http://had.co.nz/
Seemingly Similar Threads
- What do you think about my function?
- Error:non-numeric argument in my function
- Convertir programa Matlab a R sacado de Threshold Models of Collective Behavior de Michèle Lai & Yann Poltera
- Solve system of equations (nleqslv) only returns origin
- Conditional Weighted Average (ddply or any other function)