thr3ads.net - R help - [R] summarizing a data frame i.e. count -> group by [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Giovanni Azua

2011-Oct-23 17:29 UTC

[R] summarizing a data frame i.e. count -> group by

Hello,

This is one problem at the time :)

I have a data frame df that looks like this:

  time partitioning_mode workload runtime
1     1          sharding    query     607
2     1          sharding    query      85
3     1          sharding    query      52
4     1          sharding    query      79
5     1          sharding    query      77
6     1          sharding    query      67
7     1          sharding    query      98
8     1          sharding  refresh    2932
9     1          sharding  refresh    2870
10    1          sharding  refresh    2877
11    1          sharding  refresh    2868
12    1       replication    query    2891
13    1       replication    query    2907
14    1       replication    query    2922
15    1       replication    query    2937

and if I could use SQL ... omg! I really wish I could! I would do exactly this:

insert into throughput
  select time, partitioning_mode, count(*)
  from data.frame 
  group by time, partitioning_mode

My attempted R versions are wrong and produce very cryptic error message:
> throughput <- aggregate(x=df[,c("time",
"partitioning_mode")], by=list(df$time,df$partitioning_mode), count)Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions
> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode),
count)Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions
>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning),
FUN=count)I cant comprehend what comes out from this one ... :(

and I thought C++ template errors were the most cryptic ;P

Many many thanks in advance,
Best regards,
Giovanni

David Winsemius

2011-Oct-23 18:00 UTC

head link

[R] summarizing a data frame i.e. count -> group by

On Oct 23, 2011, at 1:29 PM, Giovanni Azua wrote:
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
> > df <-read.table(textConnection(" time partitioning_mode workload  
runtime
+ 1     1          sharding    query     607
+ 2     1          sharding    query      85
+ 3     1          sharding    query      52
+ 4     1          sharding    query      79
+ 5     1          sharding    query      77
+ 6     1          sharding    query      67
+ 7     1          sharding    query      98
+ 8     1          sharding  refresh    2932
+ 9     1          sharding  refresh    2870
+ 10    1          sharding  refresh    2877
+ 11    1          sharding  refresh    2868
+ 12    1       replication    query    2891
+ 13    1       replication    query    2907
+ 14    1       replication    query    2922
+ 15    1       replication    query    2937"))
 >
 > df$throughput <- ave(df$time, list(df$time, df$partitioning_mode),  
FUN=length)
 > df
    time partitioning_mode workload runtime throughput
1     1          sharding    query     607         11
2     1          sharding    query      85         11
3     1          sharding    query      52         11
4     1          sharding    query      79         11
5     1          sharding    query      77         11
6     1          sharding    query      67         11
7     1          sharding    query      98         11
8     1          sharding  refresh    2932         11
9     1          sharding  refresh    2870         11
10    1          sharding  refresh    2877         11
11    1          sharding  refresh    2868         11
12    1       replication    query    2891          4
13    1       replication    query    2907          4
14    1       replication    query    2922          4
15    1       replication    query    2937          4

>
> and if I could use SQL ... omg! I really wish I could! I would do  
> exactly this:
You can of, course use package sqldf, which would undoubtedly be good  
practice for me, but this seemed like a typical situation for using  
'ave'. You do need to use the FUN= construction in 'ave' because
that
argument appears after the triple dots in the argument list.
>
> insert into throughput
>  select time, partitioning_mode, count(*)
>  from data.frame
>  group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error  
> message:
>
>> throughput <- aggregate(x=df[,c("time",
"partitioning_mode")],
>> by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
>  incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df 
>> $partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
>  incorrect number of dimensions
>
>> throughput <- tapply(X=df$time, INDEX=list(df$time,df 
>> $partitioning), FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

Tyler Rinker

2011-Oct-23 18:05 UTC

head link

[R] summarizing a data frame i.e. count -> group by

This could be done with aggregate but I am unfamiliar with it so I'll give
what I think you want from your message using the library 'reshape' that
you'll have to doneload.  If you're problem is large the data.table
library would be much faster.
 
You haven't really said what you'd like to get from the output so
I'm going by what your code looks like you want. There is no count in R, the
function is called length (you may want sum but it does not appear that way). 
Also giving the list a bit of what you'd expect for an out put is often
helpful.
 
Here is the code(one of these three options is what you want I think:
 
library(reshape)
throughput1 <- cast(df, time~partitioning_mode, value="runtime", 
length)
throughput2 <- cast(df, partitioning_mode~time, value="runtime", 
length)
throughput3 <- cast(df, partitioning_mode + workload~time,
value="runtime", length)

----------------------------------------> From: bravegag at gmail.com
> Date: Sun, 23 Oct 2011 19:29:40 +0200
> To: r-help at r-project.org
> Subject: [R] summarizing a data frame i.e. count -> group by
>
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> time partitioning_mode workload runtime
> 1 1 sharding query 607
> 2 1 sharding query 85
> 3 1 sharding query 52
> 4 1 sharding query 79
> 5 1 sharding query 77
> 6 1 sharding query 67
> 7 1 sharding query 98
> 8 1 sharding refresh 2932
> 9 1 sharding refresh 2870
> 10 1 sharding refresh 2877
> 11 1 sharding refresh 2868
> 12 1 replication query 2891
> 13 1 replication query 2907
> 14 1 replication query 2922
> 15 1 replication query 2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly
this:
>
> insert into throughput
> select time, partitioning_mode, count(*)
> from data.frame
> group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
> > throughput <- aggregate(x=df[,c("time",
"partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
> > throughput <- aggregate(x=df,
by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
> >throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning),
FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jim holtman

2011-Oct-23 23:31 UTC

head link

[R] summarizing a data frame i.e. count -> group by

Another package to consider, especially if your dataframe is large, is
'data.table':
> tp <- read.table(textConnection(" time partitioning_mode workload
runtime+ 1     1          sharding    query     607
+ 2     1          sharding    query      85
+ 3     1          sharding    query      52
+ 4     1          sharding    query      79
+ 5     1          sharding    query      77
+ 6     1          sharding    query      67
+ 7     1          sharding    query      98
+ 8     1          sharding  refresh    2932
+ 9     1          sharding  refresh    2870
+ 10    1          sharding  refresh    2877
+ 11    1          sharding  refresh    2868
+ 12    1       replication    query    2891
+ 13    1       replication    query    2907
+ 14    1       replication    query    2922
+ 15    1       replication    query    2937"), as.is = TRUE, header =
TRUE)> closeAllConnections()
>
> require(data.table)Loading required package: data.table
data.table 1.7.1  For help type:
help("data.table")> tp <- data.table(tp)
> tp[+     , list(workload = workload
+         , runtime = runtime
+         , thruput = length(runtime)
+         )
+     , by = list(time, partitioning_mode)
+     ]
      time partitioning_mode workload runtime thruput
 [1,]    1          sharding    query     607      11
 [2,]    1          sharding    query      85      11
 [3,]    1          sharding    query      52      11
 [4,]    1          sharding    query      79      11
 [5,]    1          sharding    query      77      11
 [6,]    1          sharding    query      67      11
 [7,]    1          sharding    query      98      11
 [8,]    1          sharding  refresh    2932      11
 [9,]    1          sharding  refresh    2870      11
[10,]    1          sharding  refresh    2877      11
[11,]    1          sharding  refresh    2868      11
[12,]    1       replication    query    2891       4
[13,]    1       replication    query    2907       4
[14,]    1       replication    query    2922       4
[15,]    1       replication    query    2937       4


On Sun, Oct 23, 2011 at 1:29 PM, Giovanni Azua <bravegag at gmail.com>
wrote:> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> ?time partitioning_mode workload runtime
> 1 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? 607
> 2 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?85
> 3 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?52
> 4 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?79
> 5 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?77
> 6 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?67
> 7 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?98
> 8 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2932
> 9 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2870
> 10 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2877
> 11 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2868
> 12 ? ?1 ? ? ? replication ? ?query ? ?2891
> 13 ? ?1 ? ? ? replication ? ?query ? ?2907
> 14 ? ?1 ? ? ? replication ? ?query ? ?2922
> 15 ? ?1 ? ? ? replication ? ?query ? ?2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly
this:
>
> insert into throughput
> ?select time, partitioning_mode, count(*)
> ?from data.frame
> ?group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
>> throughput <- aggregate(x=df[,c("time",
"partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> ?incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode),
count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> ?incorrect number of dimensions
>
>>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning),
FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Dennis Murphy

2011-Oct-24 04:42 UTC

head link

[R] summarizing a data frame i.e. count -> group by

And the plyr version of this would be (using DF as the data frame name)

## transform method, mapping length(runtime) to all observations
## similar to David's results:
library('plyr')
ddply(DF, .(time, partitioning_mode), transform, n = length(runtime))
# or equivalently, the newer and somewhat faster
ddply(DF, .(time, partitioning_mode), mutate, n = length(runtime))

# If you just want the counts, then use

ddply(DF, .(time, partitioning_mode), summarise, n = length(runtime))

##---------
# Just for fun, here's the equivalent SQL call using sqldf():

library('sqldf')
sqldf('select time partitioning_mode count(*) from DF group by time
partitioning_mode')

# which you can distribute over multiple lines for readability, e.g.

sqldf('select time, partitioning_mode, count(*) as n
      from DF
      group by time, partitioning_mode')

# Result:
  time partitioning_mode  n
1    1       replication  4
2    1          sharding 11

##---------
# To do the same type of summary in data.table (to follow up on Jim
Holtman's post), here's one way:

library(data.table)
dt <- data.table(DF, key = 'time, partitioning_mode')
dt[, list(n = length(runtime)), by = key(dt)]
     time partitioning_mode  n
[1,]    1       replication  4
[2,]    1          sharding 11


###------
HTH,
Dennis


On Sun, Oct 23, 2011 at 10:29 AM, Giovanni Azua <bravegag at gmail.com>
wrote:> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> ?time partitioning_mode workload runtime
> 1 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? 607
> 2 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?85
> 3 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?52
> 4 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?79
> 5 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?77
> 6 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?67
> 7 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?98
> 8 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2932
> 9 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2870
> 10 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2877
> 11 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2868
> 12 ? ?1 ? ? ? replication ? ?query ? ?2891
> 13 ? ?1 ? ? ? replication ? ?query ? ?2907
> 14 ? ?1 ? ? ? replication ? ?query ? ?2922
> 15 ? ?1 ? ? ? replication ? ?query ? ?2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly
this:
>
> insert into throughput
> ?select time, partitioning_mode, count(*)
> ?from data.frame
> ?group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
>> throughput <- aggregate(x=df[,c("time",
"partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> ?incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode),
count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> ?incorrect number of dimensions
>
>>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning),
FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more maybe matching threads

R help - Oct 2011 - summarizing a data frame i.e. count -> group by

[R] summarizing a data frame i.e. count -> group by

[R] summarizing a data frame i.e. count -> group by

[R] summarizing a data frame i.e. count -> group by

[R] summarizing a data frame i.e. count -> group by

[R] summarizing a data frame i.e. count -> group by

Maybe Matching Threads