Thomas Pujol
2007-Jul-31 20:18 UTC
[R] aggregate.data.frame - prevent conversion to factors? show statistics for NA values of "by" variable?
I have a two question regarding the "aggregate.data.frame" method of the "aggregate" function. My situation: a. My "x" variable is a data.frame ("mydf") with two columns, both columns of type/format "numeric". b. My "by" variable is a data.frame("mybys") with two columns, both columns of type/format "character". c. Some of the values contained in "mybys" are originally "NA". Prior to submitting the by variables to the aggregate function, I convert the NA values to the text-string "is_na". ( I do this because I want to understand the statistics of variables where their "by" value is NA, and want this information in the results of the aggregate function.) My questions: 1. Is there a "better" way, (other then converting NA's to some text-string), to see the "statistics" ("mean", etc.) of the variables where the by is "NA"? (i.e to have them included within the results of the aggregate function) 2. When I run the aggregate function, the two column that contain the "by" variables are always formatted as "factors". Is there a way to prevent this, and to instead have them retain the format in the original "mybys" data.frame (i.e to have them come back formatted as "character"? Or do I just need to re-format them once I have my results? mydf=data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99) ) str(mydf) # myby1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12) myby2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA) myby1.new = ifelse(is.na(myby1)==T,"is_na",myby1) myby2.new = ifelse(is.na(myby2)==T,"is_na",myby2) str(myby1.new) str(myby2.new) mybys=data.frame(mbn1=myby1.new,mbn2=myby2.new , stringsAsFactors =F) str(mybys) # myagg1 = aggregate(x=mydf, by=mybys, FUN='mean') str(myagg1) myagg2 = myagg1 myagg2[1:ncol(mybys)] = as.character(unlist(myagg1[1:ncol(mybys)])) str(myagg2) myagg1 myagg2 --------------------------------- [[alternative HTML version deleted]]
Prof Brian Ripley
2007-Aug-01 03:24 UTC
[R] aggregate.data.frame - prevent conversion to factors? show statistics for NA values of "by" variable?
The behaviour has been changed in the R-devel version of R, so the 'by' columns are not converted to factors. On Tue, 31 Jul 2007, Thomas Pujol wrote:> I have a two question regarding the "aggregate.data.frame" method of the > "aggregate" function. > > My situation: > > a. My "x" variable is a data.frame ("mydf") with two columns, both > columns of type/format "numeric". > > b. My "by" variable is a data.frame("mybys") with two columns, both > columns of type/format "character". > > c. Some of the values contained in "mybys" are originally "NA".I think you mean NA_character_ , not the same thing.> Prior to submitting the by variables to the aggregate function, I > convert the NA values to the text-string "is_na". ( I do this because I > want to understand the statistics of variables where their "by" value is > NA, and want this information in the results of the aggregate function.) > > My questions: > > 1. Is there a "better" way, (other then converting NA's to some > text-string), to see the "statistics" ("mean", etc.) of the variables > where the by is "NA"? (i.e to have them included within the results of > the aggregate function)You need to tell R that the NA (not "NA") values form a group, which is not obvious as they are unknown. So you do need to recode them. Making them a factor is the obvious way (with exclude=""), and I don't understand your aveersion to factors for categorical variables.> 2. When I run the aggregate function, the two column that contain the > "by" variables are always formatted as "factors". Is there a way to > prevent this, and to instead have them retain the format in the original > "mybys" data.frame (i.e to have them come back formatted as "character"? > Or do I just need to re-format them once I have my results? > > > > mydf=data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9), testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99) ) > str(mydf) > # > > myby1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12) > myby2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA) > > myby1.new = ifelse(is.na(myby1)==T,"is_na",myby1) > myby2.new = ifelse(is.na(myby2)==T,"is_na",myby2) > str(myby1.new) > str(myby2.new) > > mybys=data.frame(mbn1=myby1.new,mbn2=myby2.new , stringsAsFactors =F) > str(mybys) > > > # > myagg1 = aggregate(x=mydf, by=mybys, FUN='mean') > str(myagg1) > > > myagg2 = myagg1 > myagg2[1:ncol(mybys)] = as.character(unlist(myagg1[1:ncol(mybys)])) > str(myagg2) > > myagg1 > myagg2 > > > --------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595