Eva Powers
2011-Oct-05 23:45 UTC
[R] aggregate function with a dataframe for both "x" and "by"
I have 2 dataframes. "mydata" contains numerical data. "mybys" contains information on the "group" each row of the data is in. I wish to aggregate each column in mydata using the corresponding column in mybys. Please see the example below. What is a more elegant or "better" way to accomplish this task? Thanks! mydata = data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99) ) mybys=data.frame(mbn1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12),mbn2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA) , stringsAsFactors =F) myaggs <- data.frame(matrix(data=NA, nrow=nrow(mydata), ncol=ncol(mydata) ) ) for(i in 1: ncol(mydata) ) { temp <- aggregate(mydata[i], by = as.list(mybys[i]), FUN=sum, na.rm=T) rownums <- match(mybys[,i],temp[,1]) myaggs[,i] <- temp[rownums,2] } myaggs Finally, how do I convert and use "mybys" to factors, so that I can tell R that the NA values form a group? I tried substituting this line above: temp <- aggregate(mydata[,i], by = as.list(mybys[,i]), FUN=sum, na.rm=T) ... but get the error message: "Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length" [[alternative HTML version deleted]]
Dennis Murphy
2011-Oct-06 02:37 UTC
[R] aggregate function with a dataframe for both "x" and "by"
Hi: It's a little tricky to read in a data frame 'by hand' without making NA a default missing value; you've got to trick it a bit. I'm doing this inefficiently, but if you have the two 'real' data sets stored in separate files, read.table() is the way to go since it provides an option for defining the form of the missing values. data.frame() doesn't have that option. To that end, mydata <- data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99)) mybys <- data.frame(mbn1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12), mbn2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA), stringsAsFactors = FALSE) mybys # You can tell that NA is a missing value since NA is printed as <NA> (character NA). # This is not the method you want, but the following 'works': lines <- " mbn1 mbn2 red wet blue dry 1 99 2 95 NA NA big damp 1 95 2 99 red red 1 99 NA NA 12 NA" mybys <- read.table(textConnection(lines), stringsAsFactors = FALSE, na.strings = "") closeAllConnections() # Now mybys treats NA as a character string. # Moral: if NA is a legitimate value, keep the data in an external file # and read it in with read.table(), using the na.strings = argument # to specify an alternative missing value string. You don't need to # use textConnection() as I did if you have the data in an external file. Next problem: mydata has 12 rows, mybys has 13. I combined the two data frames with cbind() using only the first 12 rows of mybys and then used the ddply() function from the plyr package to do the groupwise summation in addition to aggregate(). [Several other packages would also work here, including doBy and data.table.] The results are slightly different. # Combined data: myd <- cbind(mydata, mybys[1:12, ]) # Summation function with na.rm = TRUE: # Will return 0 if all values of x are NA sfun <- function(x) sum(x, na.rm = TRUE) # aggregate() version: aggregate(cbind(testvar1, testvar2) ~ V1 + V2, data = myd, FUN = sfun) V1 V2 testvar1 testvar2 1 2 95 8 88 2 1 99 14 154 3 2 99 4 44 4 big damp 5 55 5 blue dry 5 55 6 mbn1 mbn2 1 11 7 NA NA 12 132 8 red red 5 55 9 red wet 3 33 # ddply() version: library('plyr') ddply(myd, .(V1, V2), colwise(sfun, c('testvar1', 'testvar2'))) V1 V2 testvar1 testvar2 1 1 95 0 0 2 1 99 14 154 3 2 95 8 88 4 2 99 4 44 5 big damp 5 55 6 blue dry 5 55 7 mbn1 mbn2 1 11 8 NA NA 12 132 9 red red 5 55 10 red wet 3 33 Hope this is what you were after. Dennis On Wed, Oct 5, 2011 at 4:45 PM, Eva Powers <evap4442 at gmail.com> wrote:> I have 2 dataframes. ?"mydata" contains numerical data. "mybys" contains > information on the "group" each row of the data is in. ?I wish to aggregate > each column in mydata using the corresponding column in mybys. > > > > Please see the example below. ?What is a more elegant or "better" way to > accomplish this task? > > > > Thanks! > > > > > > mydata = data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), > testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99) > ) > > > > mybys=data.frame(mbn1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12),mbn2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA) > , stringsAsFactors =F) > > > > myaggs <- data.frame(matrix(data=NA, nrow=nrow(mydata), ncol=ncol(mydata) ) > ) > > > > for(i in 1: ncol(mydata) ) { > > temp <- aggregate(mydata[i], by = as.list(mybys[i]), FUN=sum, na.rm=T) > > rownums <- match(mybys[,i],temp[,1]) > > myaggs[,i] <- temp[rownums,2] > > } > > myaggs > > > > > > Finally, how do I convert and use "mybys" to factors, so that I can tell R > that the NA values form a group? > > > > I tried substituting this line above: > > temp <- aggregate(mydata[,i], by = as.list(mybys[,i]), FUN=sum, na.rm=T) > > > > ... but get the error message: "Error in > aggregate.data.frame(as.data.frame(x), ...) : > ?arguments must have same length" > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
David Winsemius
2011-Oct-06 03:47 UTC
[R] aggregate function with a dataframe for both "x" and "by"
On Oct 5, 2011, at 7:45 PM, Eva Powers wrote:> I have 2 dataframes. "mydata" contains numerical data. "mybys" > contains > information on the "group" each row of the data is in. I wish to > aggregate > each column in mydata using the corresponding column in mybys.corresponding?> > Please see the example below. What is a more elegant or "better" > way to > accomplish this task? > > mydata = data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), > testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99) ) > > mybys=data.frame(mbn1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA, > 12),mbn2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA) , > stringsAsFactors =F) > > myaggs <- data.frame(matrix(data=NA, nrow=nrow(mydata), > ncol=ncol(mydata) ) ) > > for(i in 1: ncol(mydata) ) {temp <- aggregate(mydata[i], by = > as.list(mybys[i]), FUN=sum, na.rm=T) > rownums <- match(mybys[,i],temp[,1]) > myaggs[,i] <- temp[rownums,2] } > myaggs > > Finally, how do I convert and use "mybys" to factors, so that I can > tell R > that the NA values form a group? > > I tried substituting this line above: > > temp <- aggregate(mydata[,i], by = as.list(mybys[,i]), FUN=sum, > na.rm=T) > > ... but get the error message: "Error in > aggregate.data.frame(as.data.frame(x), ...) : > arguments must have same length" >David Winsemius, MD West Hartford, CT
Possibly Parallel Threads
- aggregate.data.frame - prevent conversion to factors? show statistics for NA values of "by" variable?
- Dealing with null values Aggregate function
- incomplete final line found on <name of my sourced function file>
- Tip/Question about encoding temporary data for storage in Asterisk variable to use in AGI
- Help with aggregation