Hello, This is one problem at the time :) I have a data frame df that looks like this: time partitioning_mode workload runtime 1 1 sharding query 607 2 1 sharding query 85 3 1 sharding query 52 4 1 sharding query 79 5 1 sharding query 77 6 1 sharding query 67 7 1 sharding query 98 8 1 sharding refresh 2932 9 1 sharding refresh 2870 10 1 sharding refresh 2877 11 1 sharding refresh 2868 12 1 replication query 2891 13 1 replication query 2907 14 1 replication query 2922 15 1 replication query 2937 and if I could use SQL ... omg! I really wish I could! I would do exactly this: insert into throughput select time, partitioning_mode, count(*) from data.frame group by time, partitioning_mode My attempted R versions are wrong and produce very cryptic error message:> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count)Error in `[.default`(df2, u_id, , drop = FALSE) : incorrect number of dimensions> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)Error in `[.default`(df2, u_id, , drop = FALSE) : incorrect number of dimensions>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)I cant comprehend what comes out from this one ... :( and I thought C++ template errors were the most cryptic ;P Many many thanks in advance, Best regards, Giovanni
On Oct 23, 2011, at 1:29 PM, Giovanni Azua wrote:> Hello, > > This is one problem at the time :) > > I have a data frame df that looks like this: >> df <-read.table(textConnection(" time partitioning_mode workload runtime + 1 1 sharding query 607 + 2 1 sharding query 85 + 3 1 sharding query 52 + 4 1 sharding query 79 + 5 1 sharding query 77 + 6 1 sharding query 67 + 7 1 sharding query 98 + 8 1 sharding refresh 2932 + 9 1 sharding refresh 2870 + 10 1 sharding refresh 2877 + 11 1 sharding refresh 2868 + 12 1 replication query 2891 + 13 1 replication query 2907 + 14 1 replication query 2922 + 15 1 replication query 2937")) > > df$throughput <- ave(df$time, list(df$time, df$partitioning_mode), FUN=length) > df time partitioning_mode workload runtime throughput 1 1 sharding query 607 11 2 1 sharding query 85 11 3 1 sharding query 52 11 4 1 sharding query 79 11 5 1 sharding query 77 11 6 1 sharding query 67 11 7 1 sharding query 98 11 8 1 sharding refresh 2932 11 9 1 sharding refresh 2870 11 10 1 sharding refresh 2877 11 11 1 sharding refresh 2868 11 12 1 replication query 2891 4 13 1 replication query 2907 4 14 1 replication query 2922 4 15 1 replication query 2937 4> > and if I could use SQL ... omg! I really wish I could! I would do > exactly this:You can of, course use package sqldf, which would undoubtedly be good practice for me, but this seemed like a typical situation for using 'ave'. You do need to use the FUN= construction in 'ave' because that argument appears after the triple dots in the argument list.> > insert into throughput > select time, partitioning_mode, count(*) > from data.frame > group by time, partitioning_mode > > My attempted R versions are wrong and produce very cryptic error > message: > >> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], >> by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > >> throughput <- aggregate(x=df, by=list(df$time,df >> $partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > >> throughput <- tapply(X=df$time, INDEX=list(df$time,df >> $partitioning), FUN=count) > I cant comprehend what comes out from this one ... :( > > and I thought C++ template errors were the most cryptic ;P > > Many many thanks in advance, > Best regards, > Giovanni > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
This could be done with aggregate but I am unfamiliar with it so I'll give what I think you want from your message using the library 'reshape' that you'll have to doneload. If you're problem is large the data.table library would be much faster. You haven't really said what you'd like to get from the output so I'm going by what your code looks like you want. There is no count in R, the function is called length (you may want sum but it does not appear that way). Also giving the list a bit of what you'd expect for an out put is often helpful. Here is the code(one of these three options is what you want I think: library(reshape) throughput1 <- cast(df, time~partitioning_mode, value="runtime", length) throughput2 <- cast(df, partitioning_mode~time, value="runtime", length) throughput3 <- cast(df, partitioning_mode + workload~time, value="runtime", length) ----------------------------------------> From: bravegag at gmail.com > Date: Sun, 23 Oct 2011 19:29:40 +0200 > To: r-help at r-project.org > Subject: [R] summarizing a data frame i.e. count -> group by > > Hello, > > This is one problem at the time :) > > I have a data frame df that looks like this: > > time partitioning_mode workload runtime > 1 1 sharding query 607 > 2 1 sharding query 85 > 3 1 sharding query 52 > 4 1 sharding query 79 > 5 1 sharding query 77 > 6 1 sharding query 67 > 7 1 sharding query 98 > 8 1 sharding refresh 2932 > 9 1 sharding refresh 2870 > 10 1 sharding refresh 2877 > 11 1 sharding refresh 2868 > 12 1 replication query 2891 > 13 1 replication query 2907 > 14 1 replication query 2922 > 15 1 replication query 2937 > > and if I could use SQL ... omg! I really wish I could! I would do exactly this: > > insert into throughput > select time, partitioning_mode, count(*) > from data.frame > group by time, partitioning_mode > > My attempted R versions are wrong and produce very cryptic error message: > > > throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > > > throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > > >throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count) > I cant comprehend what comes out from this one ... :( > > and I thought C++ template errors were the most cryptic ;P > > Many many thanks in advance, > Best regards, > Giovanni > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Another package to consider, especially if your dataframe is large, is 'data.table':> tp <- read.table(textConnection(" time partitioning_mode workload runtime+ 1 1 sharding query 607 + 2 1 sharding query 85 + 3 1 sharding query 52 + 4 1 sharding query 79 + 5 1 sharding query 77 + 6 1 sharding query 67 + 7 1 sharding query 98 + 8 1 sharding refresh 2932 + 9 1 sharding refresh 2870 + 10 1 sharding refresh 2877 + 11 1 sharding refresh 2868 + 12 1 replication query 2891 + 13 1 replication query 2907 + 14 1 replication query 2922 + 15 1 replication query 2937"), as.is = TRUE, header = TRUE)> closeAllConnections() > > require(data.table)Loading required package: data.table data.table 1.7.1 For help type: help("data.table")> tp <- data.table(tp) > tp[+ , list(workload = workload + , runtime = runtime + , thruput = length(runtime) + ) + , by = list(time, partitioning_mode) + ] time partitioning_mode workload runtime thruput [1,] 1 sharding query 607 11 [2,] 1 sharding query 85 11 [3,] 1 sharding query 52 11 [4,] 1 sharding query 79 11 [5,] 1 sharding query 77 11 [6,] 1 sharding query 67 11 [7,] 1 sharding query 98 11 [8,] 1 sharding refresh 2932 11 [9,] 1 sharding refresh 2870 11 [10,] 1 sharding refresh 2877 11 [11,] 1 sharding refresh 2868 11 [12,] 1 replication query 2891 4 [13,] 1 replication query 2907 4 [14,] 1 replication query 2922 4 [15,] 1 replication query 2937 4 On Sun, Oct 23, 2011 at 1:29 PM, Giovanni Azua <bravegag at gmail.com> wrote:> Hello, > > This is one problem at the time :) > > I have a data frame df that looks like this: > > ?time partitioning_mode workload runtime > 1 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? 607 > 2 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?85 > 3 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?52 > 4 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?79 > 5 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?77 > 6 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?67 > 7 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?98 > 8 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2932 > 9 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2870 > 10 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2877 > 11 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2868 > 12 ? ?1 ? ? ? replication ? ?query ? ?2891 > 13 ? ?1 ? ? ? replication ? ?query ? ?2907 > 14 ? ?1 ? ? ? replication ? ?query ? ?2922 > 15 ? ?1 ? ? ? replication ? ?query ? ?2937 > > and if I could use SQL ... omg! I really wish I could! I would do exactly this: > > insert into throughput > ?select time, partitioning_mode, count(*) > ?from data.frame > ?group by time, partitioning_mode > > My attempted R versions are wrong and produce very cryptic error message: > >> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > ?incorrect number of dimensions > >> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > ?incorrect number of dimensions > >>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count) > I cant comprehend what comes out from this one ... :( > > and I thought C++ template errors were the most cryptic ;P > > Many many thanks in advance, > Best regards, > Giovanni > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
And the plyr version of this would be (using DF as the data frame name) ## transform method, mapping length(runtime) to all observations ## similar to David's results: library('plyr') ddply(DF, .(time, partitioning_mode), transform, n = length(runtime)) # or equivalently, the newer and somewhat faster ddply(DF, .(time, partitioning_mode), mutate, n = length(runtime)) # If you just want the counts, then use ddply(DF, .(time, partitioning_mode), summarise, n = length(runtime)) ##--------- # Just for fun, here's the equivalent SQL call using sqldf(): library('sqldf') sqldf('select time partitioning_mode count(*) from DF group by time partitioning_mode') # which you can distribute over multiple lines for readability, e.g. sqldf('select time, partitioning_mode, count(*) as n from DF group by time, partitioning_mode') # Result: time partitioning_mode n 1 1 replication 4 2 1 sharding 11 ##--------- # To do the same type of summary in data.table (to follow up on Jim Holtman's post), here's one way: library(data.table) dt <- data.table(DF, key = 'time, partitioning_mode') dt[, list(n = length(runtime)), by = key(dt)] time partitioning_mode n [1,] 1 replication 4 [2,] 1 sharding 11 ###------ HTH, Dennis On Sun, Oct 23, 2011 at 10:29 AM, Giovanni Azua <bravegag at gmail.com> wrote:> Hello, > > This is one problem at the time :) > > I have a data frame df that looks like this: > > ?time partitioning_mode workload runtime > 1 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? 607 > 2 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?85 > 3 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?52 > 4 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?79 > 5 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?77 > 6 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?67 > 7 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?98 > 8 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2932 > 9 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2870 > 10 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2877 > 11 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2868 > 12 ? ?1 ? ? ? replication ? ?query ? ?2891 > 13 ? ?1 ? ? ? replication ? ?query ? ?2907 > 14 ? ?1 ? ? ? replication ? ?query ? ?2922 > 15 ? ?1 ? ? ? replication ? ?query ? ?2937 > > and if I could use SQL ... omg! I really wish I could! I would do exactly this: > > insert into throughput > ?select time, partitioning_mode, count(*) > ?from data.frame > ?group by time, partitioning_mode > > My attempted R versions are wrong and produce very cryptic error message: > >> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > ?incorrect number of dimensions > >> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > ?incorrect number of dimensions > >>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count) > I cant comprehend what comes out from this one ... :( > > and I thought C++ template errors were the most cryptic ;P > > Many many thanks in advance, > Best regards, > Giovanni > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >