Sivakumaran Raman
2005-Mar-29 01:15 UTC
[R] Aggregating data (with more than one function)
I have the data similar to the following in a data frame: LastName Department Salary 1 Johnson IT 56000 2 James HR 54223 3 Howe Finance 80000 4 Jones Finance 82000 5 Norwood IT 67000 6 Benson Sales 76000 7 Smith Sales 65778 8 Baker HR 56778 9 Dempsey HR 78999 10 Nolan Sales 45667 11 Garth Finance 89777 12 Jameson IT 56786 I want to calculate both the mean salary broken down by Department and also the total amount paid out per department i.e. I want both sum(Salary) and mean(Salary) for each Department. Right now, I am using aggregate.data.frame twice, creating two data frames, and then combining them using data.frame. However, this seems to be very memory and processor intensive and is taking a very long time on my data set. Is there a quicker way to do this? Thanks in advance, Siv Raman
Here's one possible way, using the data you supplied:> dat <- read.table("clipboard", header=T, row=1) > do.call("rbind",by(dat$Salary, dat$Department, function(x) c(mean=mean(x),total=sum(x)))) mean total Finance 83925.67 251777 HR 63333.33 190000 IT 59928.67 179786 Sales 62481.67 187445 If you need the department names as a variable, you can add that easily. HTH, Andy> From: Sivakumaran Raman > > I have the data similar to the following in a data frame: > LastName Department Salary > 1 Johnson IT 56000 > 2 James HR 54223 > 3 Howe Finance 80000 > 4 Jones Finance 82000 > 5 Norwood IT 67000 > 6 Benson Sales 76000 > 7 Smith Sales 65778 > 8 Baker HR 56778 > 9 Dempsey HR 78999 > 10 Nolan Sales 45667 > 11 Garth Finance 89777 > 12 Jameson IT 56786 > > I want to calculate both the mean salary broken down by > Department and > also the > total amount paid out per department i.e. I want both sum(Salary) and > mean(Salary) for each Department. Right now, I am using > aggregate.data.frame > twice, creating two data frames, and then combining them > using data.frame. > However, this seems to be very memory and processor intensive and is > taking a > very long time on my data set. Is there a quicker way to do this? > > Thanks in advance, > Siv Raman > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
On Mon, 2005-03-28 at 19:15 -0600, Sivakumaran Raman wrote:> I have the data similar to the following in a data frame: > LastName Department Salary > 1 Johnson IT 56000 > 2 James HR 54223 > 3 Howe Finance 80000 > 4 Jones Finance 82000 > 5 Norwood IT 67000 > 6 Benson Sales 76000 > 7 Smith Sales 65778 > 8 Baker HR 56778 > 9 Dempsey HR 78999 > 10 Nolan Sales 45667 > 11 Garth Finance 89777 > 12 Jameson IT 56786 > > I want to calculate both the mean salary broken down by Department and > also the > total amount paid out per department i.e. I want both sum(Salary) and > mean(Salary) for each Department. Right now, I am using aggregate.data.frame > twice, creating two data frames, and then combining them using data.frame. > However, this seems to be very memory and processor intensive and is > taking a > very long time on my data set. Is there a quicker way to do this? > > Thanks in advance, > Siv RamanHere is one approach: Presuming that 'df' is your data frame: # Create a function that returns both values my.summ <- function(x) { c(mean = mean(x), sum = sum(x)) } # Now split() 'df' by Department df.s <- split(df$Salary, df$Department) # Now run the summary, using sapply()> sapply(df.s, my.summ)Finance HR IT Sales mean 83925.67 63333.33 59928.67 62481.67 sum 251777 190000 179786 187445 HTH, Marc Schwartz
Adaikalavan Ramasamy
2005-Mar-29 01:59 UTC
[R] Aggregating data (with more than one function)
In the Arguments section of help(aggregate), you will find : FUN: a scalar function to compute the summary statistics which can be applied to all data subsets. a) So you can try the 'by' function :> by( df[ , 3], df$Department, function(x) c(mean(x), sum(x)) )INDICES: Finance [1] 83925.67 251777.00 ------------------------------------------------------------ INDICES: HR [1] 63333.33 190000.00 ------------------------------------------------------------ INDICES: IT [1] 59928.67 179786.00 ------------------------------------------------------------ INDICES: Sales [1] 62481.67 187445.00 b) or use tapply more directly :> tmp <- tapply(df$Salary, df$Department, function(x)c( mean(x), sum(x) ) ) $Finance [1] 83925.67 251777.00 $HR [1] 63333.33 190000.00 $IT [1] 59928.67 179786.00 $Sales [1] 62481.67 187445.00 And using the 'sapply( tmp, c )' gives you a slightly more compact output as Finance HR IT Sales [1,] 83925.67 63333.33 59928.67 62481.67 [2,] 251777.00 190000.00 179786.00 187445.00 Regards, Adai On Mon, 2005-03-28 at 19:15 -0600, Sivakumaran Raman wrote:> I have the data similar to the following in a data frame: > LastName Department Salary > 1 Johnson IT 56000 > 2 James HR 54223 > 3 Howe Finance 80000 > 4 Jones Finance 82000 > 5 Norwood IT 67000 > 6 Benson Sales 76000 > 7 Smith Sales 65778 > 8 Baker HR 56778 > 9 Dempsey HR 78999 > 10 Nolan Sales 45667 > 11 Garth Finance 89777 > 12 Jameson IT 56786 > > I want to calculate both the mean salary broken down by Department and > also the > total amount paid out per department i.e. I want both sum(Salary) and > mean(Salary) for each Department. Right now, I am using aggregate.data.frame > twice, creating two data frames, and then combining them using data.frame. > However, this seems to be very memory and processor intensive and is > taking a > very long time on my data set. Is there a quicker way to do this? > > Thanks in advance, > Siv Raman > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Dear list & Andy, I am hopelessly stumped, how would one add the department names as a variable? Robin> Robin Tori Schroeder > International Institute for Sustainability > P.O. Box 873211 > Arizona State University > Tempe, Arizona 85287-3211 > Phone: (480) 727-7290 > >-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Liaw, Andy Sent: Monday, March 28, 2005 6:45 PM To: 'Sivakumaran Raman'; r-help at stat.math.ethz.ch Subject: RE: [R] Aggregating data (with more than one function) Here's one possible way, using the data you supplied:> dat <- read.table("clipboard", header=T, row=1) > do.call("rbind",by(dat$Salary, dat$Department, function(x) c(mean=mean(x),total=sum(x)))) mean total Finance 83925.67 251777 HR 63333.33 190000 IT 59928.67 179786 Sales 62481.67 187445 If you need the department names as a variable, you can add that easily. HTH, Andy> From: Sivakumaran Raman > > I have the data similar to the following in a data frame: > LastName Department Salary > 1 Johnson IT 56000 > 2 James HR 54223 > 3 Howe Finance 80000 > 4 Jones Finance 82000 > 5 Norwood IT 67000 > 6 Benson Sales 76000 > 7 Smith Sales 65778 > 8 Baker HR 56778 > 9 Dempsey HR 78999 > 10 Nolan Sales 45667 > 11 Garth Finance 89777 > 12 Jameson IT 56786 > > I want to calculate both the mean salary broken down by > Department and > also the > total amount paid out per department i.e. I want both sum(Salary) and > mean(Salary) for each Department. Right now, I am using > aggregate.data.frame > twice, creating two data frames, and then combining them > using data.frame. > However, this seems to be very memory and processor intensive and is > taking a > very long time on my data set. Is there a quicker way to do this? > > Thanks in advance, > Siv Raman > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> agg.dat <- do.call("rbind", by(dat$Salary, dat$Department,+ function(x) c(mean=mean(x), total=sum(x))))> agg.dat <- data.frame(dept=rownames(agg.dat), agg.dat) > agg.datdept mean total Finance Finance 83925.67 251777 HR HR 63333.33 190000 IT IT 59928.67 179786 Sales Sales 62481.67 187445 Andy> From: Robin Schroeder > > Dear list & Andy, > > I am hopelessly stumped, how would one add the department > names as a variable? > > Robin > > > Robin Tori Schroeder > > International Institute for Sustainability > > P.O. Box 873211 > > Arizona State University > > Tempe, Arizona 85287-3211 > > Phone: (480) 727-7290 > > > > > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Liaw, Andy > Sent: Monday, March 28, 2005 6:45 PM > To: 'Sivakumaran Raman'; r-help at stat.math.ethz.ch > Subject: RE: [R] Aggregating data (with more than one function) > > > Here's one possible way, using the data you supplied: > > > dat <- read.table("clipboard", header=T, row=1) > > do.call("rbind",by(dat$Salary, dat$Department, function(x) > c(mean=mean(x), > total=sum(x)))) > mean total > Finance 83925.67 251777 > HR 63333.33 190000 > IT 59928.67 179786 > Sales 62481.67 187445 > > If you need the department names as a variable, you can add > that easily. > > HTH, > Andy > > > From: Sivakumaran Raman > > > > I have the data similar to the following in a data frame: > > LastName Department Salary > > 1 Johnson IT 56000 > > 2 James HR 54223 > > 3 Howe Finance 80000 > > 4 Jones Finance 82000 > > 5 Norwood IT 67000 > > 6 Benson Sales 76000 > > 7 Smith Sales 65778 > > 8 Baker HR 56778 > > 9 Dempsey HR 78999 > > 10 Nolan Sales 45667 > > 11 Garth Finance 89777 > > 12 Jameson IT 56786 > > > > I want to calculate both the mean salary broken down by > > Department and > > also the > > total amount paid out per department i.e. I want both > sum(Salary) and > > mean(Salary) for each Department. Right now, I am using > > aggregate.data.frame > > twice, creating two data frames, and then combining them > > using data.frame. > > However, this seems to be very memory and processor > intensive and is > > taking a > > very long time on my data set. Is there a quicker way to do this? > > > > Thanks in advance, > > Siv Raman > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
I am looking for an answer to a similar question - a generalized solution that would be able to apply (1) any number of functions (2) to any number of vectors (3) by any number of factors (just like SQL's group by). The output data frame must contain the values of the by factors, to be used for joins. Aggregate() does (2) and (3). The solutions posted to this thread (split+sapply, by, tapply) do (1) and (3) (or so it seems to me). What would be the best way to get to (1)+(2)+(3)? I am inclined to use aggregate() in a loop with eval(parse(text="aggregate expression here")). Running groupby <- do.call("rbind", by(var_i, list(a,b,c,d,e,f), function(x) c(fct1(x),fct2(x),fct3(x),fct4(x)))) in a loop (var_1, var_2 etc) would be very nice but I don't know how to add a-f as columns in the output data frame. Thank you, b. On Mon, 2005-03-28 at 19:15 -0600, Sivakumaran Raman wrote:> I have the data similar to the following in a data frame: > LastName Department Salary > 1 Johnson IT 56000 > 2 James HR 54223 > 3 Howe Finance 80000 > 4 Jones Finance 82000 > 5 Norwood IT 67000 > 6 Benson Sales 76000 > 7 Smith Sales 65778 > 8 Baker HR 56778 > 9 Dempsey HR 78999 > 10 Nolan Sales 45667 > 11 Garth Finance 89777 > 12 Jameson IT 56786 > > I want to calculate both the mean salary broken down by Department and > also the > total amount paid out per department i.e. I want both sum(Salary) and > mean(Salary) for each Department. Right now, I am using aggregate.data.frame > twice, creating two data frames, and then combining them using data.frame. > However, this seems to be very memory and processor intensive and is > taking a > very long time on my data set. Is there a quicker way to do this? > > Thanks in advance, > Siv Raman > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html