The aggregate function does "almost" all that I need to summarize a datasets, except that I can't specify exclusion of NAs without a little bit of hassle.> set.seed(143) > m <- data.frame(A=sample(LETTERS[1:5], 20, T), B=sample(LETTERS[1:10], 20, T), C=sample(c(NA, 1:4), 20, T), D=sample(c(NA,1:4), 20, T)) > mA B C D 1 E I 1 NA 2 A C NA NA 3 D I NA 3 4 C I 2 4 5 A C 3 2 6 E J 1 2 7 D J 2 2 8 C G 4 1 9 C D NA 3 10 B G 3 NA 11 C B 4 2 12 A B NA NA 13 E A NA 4 14 B B 3 3 15 E I 4 1 16 E J 3 1 17 B J 4 4 18 B J 1 3 19 D D 4 2 20 B B 4 3> aggregate(m[,-c(1:2)], by=list(m[,1]), sum)Group.1 C D 1 A NA NA 2 B 15 NA 3 C NA 10 4 D NA 7 5 E NA NA> aggregate(m[,-c(1:2)], by=list(m[,1]), length)Group.1 C D 1 A 3 3 2 B 5 5 3 C 4 4 4 D 3 3 5 E 5 5 My own defined version of length and sum to exclude NA> mylength <- function(x) { sum(as.logical(x), na.rm=T) } > mysum <- function(x) {sum(x, na.rm=T)}> aggregate(m[,-c(1:2)], by=list(m[,1]), mysum) <----------------- this computes correctly.Group.1 C D 1 A 3 2 2 B 15 13 3 C 10 10 4 D 6 7 5 E 9 8> aggregate(m[,-c(1:2)], by=list(m[,1]), mylength) <----------------- this computes correctly.Group.1 C D 1 A 1 1 2 B 5 4 3 C 3 4 4 D 2 3 5 E 4 4 There are other statistics I need to compute e.g. var, sd, and it is a hassle to create customized versions to exclude NA. Any alternative approaches ? _________________________________________________________________ [[elided Hotmail spam]]
Try aggregate(m[, -(1:2)], m[1], sum, na.rm = TRUE) aggregate(!is.na(m[, -(1:2)]), m[1], sum, na.rm = TRUE) # or (this uses row names rather than a column for the group): rowsum(m[, -(1:2)], m[,1], na.rm = TRUE) rowsum(0+!is.na(m[, -(1:2)]), m[,1], na.rm = TRUE) On Sun, Dec 7, 2008 at 7:06 AM, Daren Tan <daren76 at hotmail.com> wrote:> > The aggregate function does "almost" all that I need to summarize a datasets, except that I can't specify exclusion of NAs without a little bit of hassle. > >> set.seed(143) >> m <- data.frame(A=sample(LETTERS[1:5], 20, T), B=sample(LETTERS[1:10], 20, T), C=sample(c(NA, 1:4), 20, T), D=sample(c(NA,1:4), 20, T)) >> m > A B C D > 1 E I 1 NA > 2 A C NA NA > 3 D I NA 3 > 4 C I 2 4 > 5 A C 3 2 > 6 E J 1 2 > 7 D J 2 2 > 8 C G 4 1 > 9 C D NA 3 > 10 B G 3 NA > 11 C B 4 2 > 12 A B NA NA > 13 E A NA 4 > 14 B B 3 3 > 15 E I 4 1 > 16 E J 3 1 > 17 B J 4 4 > 18 B J 1 3 > 19 D D 4 2 > 20 B B 4 3 > >> aggregate(m[,-c(1:2)], by=list(m[,1]), sum) > Group.1 C D > 1 A NA NA > 2 B 15 NA > 3 C NA 10 > 4 D NA 7 > 5 E NA NA > >> aggregate(m[,-c(1:2)], by=list(m[,1]), length) > Group.1 C D > 1 A 3 3 > 2 B 5 5 > 3 C 4 4 > 4 D 3 3 > 5 E 5 5 > > My own defined version of length and sum to exclude NA > >> mylength <- function(x) { sum(as.logical(x), na.rm=T) } >> mysum <- function(x) {sum(x, na.rm=T)} > >> aggregate(m[,-c(1:2)], by=list(m[,1]), mysum) <----------------- this computes correctly. > Group.1 C D > 1 A 3 2 > 2 B 15 13 > 3 C 10 10 > 4 D 6 7 > 5 E 9 8 > >> aggregate(m[,-c(1:2)], by=list(m[,1]), mylength) <----------------- this computes correctly. > Group.1 C D > 1 A 1 1 > 2 B 5 4 > 3 C 3 4 > 4 D 2 3 > 5 E 4 4 > > There are other statistics I need to compute e.g. var, sd, and it is a hassle to create customized versions to exclude NA. Any alternative approaches ? > > > > > _________________________________________________________________ > [[elided Hotmail spam]] > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
This should work. I updated a personal modification of aggregate that I made to facilitate return of multiple values if necessary: #modified aggregate command, implementing na.rm for all functions and allowing for multiple and/or named return values values agg=function(z,Ind,FUN,na.rm=F,...){ if(na.rm){ for(i in 1:length(Ind)){ Ind[[i]] = Ind[[i]][!is.na(z)] } z = z[!is.na(z)] } FUN.out=by(z,Ind,FUN,...) num.cells=length(FUN.out) num.values=length(FUN.out[[1]]) Ind.levels = list() for(i in 1:length(Ind)){ Ind.levels[[i]]=levels(factor(Ind[[i]])) } temp=expand.grid(Ind.levels) if(is.character(names(Ind))){ names(temp) = names(Ind) }else{ names(temp) = paste('Var',1:length(Ind),sep='') } for(i in 1:num.values){ temp$new=NA n=names(FUN.out[[1]])[i] names(temp)[length(temp)]=ifelse(!is.null(n),n,ifelse(i==1,'x',paste('x',i,sep=''))) for(j in 1:num.cells){ temp[j,length(temp)]=FUN.out[[j]][i] } } return(temp) } # create some data z=rnorm(100) A=rep(1:2,each=25,2) B=rep(1:2,each=50) Ind=list(A=A,B=B) aggregate(z,Ind,mean) agg(z,Ind,mean) #should be identical to aggregate aggregate(z,Ind,summary) #returns an error agg(z,Ind,summary) #returns named columns # Make a function that returns multiple unnamed values summary2=function(x){ s=summary(x) names(s)=NULL return(s) } agg(z,Ind,summary2) #returns multiple columns, default names #demonstrate implementation of na.rm z[1]=NA z[100]=NA agg(z,Ind,sum) #returns NA for some cells agg(z,Ind,sum,na.rm=T) #removes NAs before calculating sum On Sun, Dec 7, 2008 at 8:06 AM, Daren Tan <daren76@hotmail.com> wrote:> > The aggregate function does "almost" all that I need to summarize a > datasets, except that I can't specify exclusion of NAs without a little bit > of hassle. > > > set.seed(143) > > m <- data.frame(A=sample(LETTERS[1:5], 20, T), B=sample(LETTERS[1:10], > 20, T), C=sample(c(NA, 1:4), 20, T), D=sample(c(NA,1:4), 20, T)) > > m > A B C D > 1 E I 1 NA > 2 A C NA NA > 3 D I NA 3 > 4 C I 2 4 > 5 A C 3 2 > 6 E J 1 2 > 7 D J 2 2 > 8 C G 4 1 > 9 C D NA 3 > 10 B G 3 NA > 11 C B 4 2 > 12 A B NA NA > 13 E A NA 4 > 14 B B 3 3 > 15 E I 4 1 > 16 E J 3 1 > 17 B J 4 4 > 18 B J 1 3 > 19 D D 4 2 > 20 B B 4 3 > > > aggregate(m[,-c(1:2)], by=list(m[,1]), sum) > Group.1 C D > 1 A NA NA > 2 B 15 NA > 3 C NA 10 > 4 D NA 7 > 5 E NA NA > > > aggregate(m[,-c(1:2)], by=list(m[,1]), length) > Group.1 C D > 1 A 3 3 > 2 B 5 5 > 3 C 4 4 > 4 D 3 3 > 5 E 5 5 > > My own defined version of length and sum to exclude NA > > > mylength <- function(x) { sum(as.logical(x), na.rm=T) } > > mysum <- function(x) {sum(x, na.rm=T)} > > > aggregate(m[,-c(1:2)], by=list(m[,1]), mysum) <----------------- this > computes correctly. > Group.1 C D > 1 A 3 2 > 2 B 15 13 > 3 C 10 10 > 4 D 6 7 > 5 E 9 8 > > > aggregate(m[,-c(1:2)], by=list(m[,1]), mylength) <----------------- this > computes correctly. > Group.1 C D > 1 A 1 1 > 2 B 5 4 > 3 C 3 4 > 4 D 2 3 > 5 E 4 4 > > There are other statistics I need to compute e.g. var, sd, and it is a > hassle to create customized versions to exclude NA. Any alternative > approaches ? > > > > > _________________________________________________________________ > [[elided Hotmail spam]] > > ______________________________________________ > R-help@r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Mike Lawrence Graduate Student Department of Psychology Dalhousie University thatmike.com Looking to arrange a meeting? Do so at: timetomeet.info/with/mike ~ Certainty is folly... I think. ~ [[alternative HTML version deleted]]
>> aggregate(m[,-c(1:2)], by=list(m[,1]), mysum) <----------------- this computes correctly. > Group.1 C D > 1 A 3 2 > 2 B 15 13 > 3 C 10 10 > 4 D 6 7 > 5 E 9 8 > >> aggregate(m[,-c(1:2)], by=list(m[,1]), mylength) <----------------- this computes correctly. > Group.1 C D > 1 A 1 1 > 2 B 5 4 > 3 C 3 4 > 4 D 2 3 > 5 E 4 4 > > There are other statistics I need to compute e.g. var, sd, and it is a hassle to create customized versions to exclude NA. Any alternative approaches ?How about writing a function to do the customisation for you? na.rm <- function(f) { function(x, ...) f(x[!is.na(x)], ...) } aggregate(m[,-c(1:2)], by=list(m[,1]), na.rm(sum)) aggregate(m[,-c(1:2)], by=list(m[,1]), na.rm(length)) Hadley -- had.co.nz
How to use the na.rm function outside aggregate ? I tried na.rm <- function(f) { function(x, ...) f(x[!is.na(x)], ...) }>na.rm(sum(c(NA,1,2)))function(x, ...) f(x[!is.na(x)], ...)> na.rm(sum, c(NA,1,2))Error in na.rm(sum, c(NA, 1, 2)) : unused argument(s) (c(NA, 1, 2))> Date: Sun, 7 Dec 2008 07:45:14 -0600 > From: h.wickham at gmail.com > To: daren76 at hotmail.com > Subject: Re: [R] How to force aggregate to exclude NA ? > CC: r-help at stat.math.ethz.ch > >>> aggregate(m[,-c(1:2)], by=list(m[,1]), mysum) <----------------- this computes correctly. >> Group.1 C D >> 1 A 3 2 >> 2 B 15 13 >> 3 C 10 10 >> 4 D 6 7 >> 5 E 9 8 >> >>> aggregate(m[,-c(1:2)], by=list(m[,1]), mylength) <----------------- this computes correctly. >> Group.1 C D >> 1 A 1 1 >> 2 B 5 4 >> 3 C 3 4 >> 4 D 2 3 >> 5 E 4 4 >> >> There are other statistics I need to compute e.g. var, sd, and it is a hassle to create customized versions to exclude NA. Any alternative approaches ? > > How about writing a function to do the customisation for you? > > na.rm <- function(f) { > function(x, ...) f(x[!is.na(x)], ...) > } > > aggregate(m[,-c(1:2)], by=list(m[,1]), na.rm(sum)) > aggregate(m[,-c(1:2)], by=list(m[,1]), na.rm(length)) > > Hadley > > -- > had.co.nz
On Sun, Dec 7, 2008 at 10:10 AM, Daren Tan <daren76 at hotmail.com> wrote:> > How to use the na.rm function outside aggregate ? I tried > > na.rm <- function(f) { > function(x, ...) f(x[!is.na(x)], ...) > } > > >>na.rm(sum(c(NA,1,2))) > > function(x, ...) f(x[!is.na(x)], ...) > > >> na.rm(sum, c(NA,1,2)) > Error in na.rm(sum, c(NA, 1, 2)) : unused argument(s) (c(NA, 1, 2))na.rm(sum)(c(NA, 1, 2)) Hadley -- had.co.nz
Reasonably Related Threads
- Suggestion to extend aggregate() to return multiple and/or named values
- aggregate data.frame by one column
- grouping values
- Any simple way to subset a vector of strings that do contain a particular substring ?
- Identifying common prefixes from a vector of words, and delete those prefixes