Cézar Freitas
2008-Oct-01 12:21 UTC
[R] "tapply versus by" in function with more than 1 arguments
Hi. I searched the list and didn't found nothing similar to this. I simplified my example like below: #I need calculate correlation (for example) between 2 columns classified by a third one at a data.frame, like below: #number of rows nr = 10 #the third column is to enforce that I need correlation on two variables only dataf = as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4)) names(dataf)[4] = "class" #> dataf # V1 V2 V3 class #1 0.56933020 1.2529931 0.30774422 1 #2 0.41702299 -1.6441547 0.76140046 1 #3 -1.07671647 -4.8747575 0.43706944 1 #4 -1.97701167 1.3015196 0.04390175 2 #5 0.56501325 1.8597720 0.08174124 2 #6 0.70068638 1.7922641 0.74730126 2 #7 -1.39956177 -1.9918904 0.64521918 3 #8 0.27086664 0.3745362 0.61026133 3 #9 0.04282347 3.7360407 0.48696109 3 #10 -0.34262654 0.7933674 0.09824913 3 #I tried: tapply(dataf$V1, dataf$class, cor, dataf$V2) #Error FUN(X[[1L]], ...) : incompatible dimensions tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class)) #Error FUN(X[[1L]], ...) : incompatible dimensions #But using "by" I obtain: by(dataf[,c("V1","V2")], dataf$class, cor) #dataf$class: 1 # V1 V2 #V1 1.00000 0.91777 #V2 0.91777 1.00000 #-------------------------------------------------------------------------------------------------- #dataf$class: 2 # V1 V2 #V1 1.000000 0.987857 #V2 0.987857 1.000000 #-------------------------------------------------------------------------------------------------- #dataf$class: 3 # V1 V2 #V1 1.0000000 0.7318938 #V2 0.7318938 1.0000000 #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and 0.7318938, but I think that tapply can works better, if I can solve the problem. Thanks, Cezar Novos endereços, o Yahoo! que você conhece. Crie um email novo com a sua cara @ymail.com ou @rocketmail.com. http://br.new.mail.yahoo.com/addresses [[alternative HTML version deleted]]
Henrique Dallazuanna
2008-Oct-01 15:59 UTC
[R] "tapply versus by" in function with more than 1 arguments
Try this: sapply(by(dataf[,c("V1","V2")], dataf$class, cor), '[', 3) On Wed, Oct 1, 2008 at 9:21 AM, C?zar Freitas <cafanselmo12 at yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I simplified my example like below: > > #I need calculate correlation (for example) between 2 columns classified by a third one at a data.frame, like below: > > #number of rows > nr = 10 > > #the third column is to enforce that I need correlation on two variables only > dataf = as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4)) > names(dataf)[4] = "class" > > #> dataf > # V1 V2 V3 class > #1 0.56933020 1.2529931 0.30774422 1 > #2 0.41702299 -1.6441547 0.76140046 1 > #3 -1.07671647 -4.8747575 0.43706944 1 > #4 -1.97701167 1.3015196 0.04390175 2 > #5 0.56501325 1.8597720 0.08174124 2 > #6 0.70068638 1.7922641 0.74730126 2 > #7 -1.39956177 -1.9918904 0.64521918 3 > #8 0.27086664 0.3745362 0.61026133 3 > #9 0.04282347 3.7360407 0.48696109 3 > #10 -0.34262654 0.7933674 0.09824913 3 > > #I tried: > > tapply(dataf$V1, dataf$class, cor, dataf$V2) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class)) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > #But using "by" I obtain: > > by(dataf[,c("V1","V2")], dataf$class, cor) > > #dataf$class: 1 > # V1 V2 > #V1 1.00000 0.91777 > #V2 0.91777 1.00000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 2 > # V1 V2 > #V1 1.000000 0.987857 > #V2 0.987857 1.000000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 3 > # V1 V2 > #V1 1.0000000 0.7318938 > #V2 0.7318938 1.0000000 > > #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and 0.7318938, but I think that tapply can works better, if I can solve the problem. > > Thanks, > Cezar > > > Novos endere?os, o Yahoo! que voc? conhece. Crie um email novo com a sua cara @ymail.com ou @rocketmail.com. > http://br.new.mail.yahoo.com/addresses > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O
Gabor Grothendieck
2008-Oct-01 16:06 UTC
[R] "tapply versus by" in function with more than 1 arguments
The first tapply in your question subsets V1 but not V2 so they are of different length. To subset both tapply over the row names and perform the subsetting in the function: tapply(rownames(dataf), dataf$class, function(r) cor(dataf[r, "V1"], dataf[r, "V2"])) or tapply(rownames(dataf), dataf$class, function(r) with(dataf[r, ], cor(V1, V2))) On Wed, Oct 1, 2008 at 8:21 AM, C?zar Freitas <cafanselmo12 at yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I simplified my example like below: > > #I need calculate correlation (for example) between 2 columns classified by a third one at a data.frame, like below: > > #number of rows > nr = 10 > > #the third column is to enforce that I need correlation on two variables only > dataf = as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4)) > names(dataf)[4] = "class" > > #> dataf > # V1 V2 V3 class > #1 0.56933020 1.2529931 0.30774422 1 > #2 0.41702299 -1.6441547 0.76140046 1 > #3 -1.07671647 -4.8747575 0.43706944 1 > #4 -1.97701167 1.3015196 0.04390175 2 > #5 0.56501325 1.8597720 0.08174124 2 > #6 0.70068638 1.7922641 0.74730126 2 > #7 -1.39956177 -1.9918904 0.64521918 3 > #8 0.27086664 0.3745362 0.61026133 3 > #9 0.04282347 3.7360407 0.48696109 3 > #10 -0.34262654 0.7933674 0.09824913 3 > > #I tried: > > tapply(dataf$V1, dataf$class, cor, dataf$V2) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class)) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > #But using "by" I obtain: > > by(dataf[,c("V1","V2")], dataf$class, cor) > > #dataf$class: 1 > # V1 V2 > #V1 1.00000 0.91777 > #V2 0.91777 1.00000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 2 > # V1 V2 > #V1 1.000000 0.987857 > #V2 0.987857 1.000000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 3 > # V1 V2 > #V1 1.0000000 0.7318938 > #V2 0.7318938 1.0000000 > > #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and 0.7318938, but I think that tapply can works better, if I can solve the problem. > > Thanks, > Cezar > > > Novos endere?os, o Yahoo! que voc? conhece. Crie um email novo com a sua cara @ymail.com ou @rocketmail.com. > http://br.new.mail.yahoo.com/addresses > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
hadley wickham
2008-Oct-01 16:27 UTC
[R] "tapply versus by" in function with more than 1 arguments
On Wed, Oct 1, 2008 at 7:21 AM, C?zar Freitas <cafanselmo12 at yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I simplified my example like below: > > #I need calculate correlation (for example) between 2 columns classified by a third one at a data.frame, like below: > > #number of rows > nr = 10 > > #the third column is to enforce that I need correlation on two variables only > dataf = as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4)) > names(dataf)[4] = "class" > > #> dataf > # V1 V2 V3 class > #1 0.56933020 1.2529931 0.30774422 1 > #2 0.41702299 -1.6441547 0.76140046 1 > #3 -1.07671647 -4.8747575 0.43706944 1 > #4 -1.97701167 1.3015196 0.04390175 2 > #5 0.56501325 1.8597720 0.08174124 2 > #6 0.70068638 1.7922641 0.74730126 2 > #7 -1.39956177 -1.9918904 0.64521918 3 > #8 0.27086664 0.3745362 0.61026133 3 > #9 0.04282347 3.7360407 0.48696109 3 > #10 -0.34262654 0.7933674 0.09824913 3 > > #I tried: > > tapply(dataf$V1, dataf$class, cor, dataf$V2) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class)) > #Error FUN(X[[1L]], ...) : incompatible dimensions > > #But using "by" I obtain: > > by(dataf[,c("V1","V2")], dataf$class, cor) > > #dataf$class: 1 > # V1 V2 > #V1 1.00000 0.91777 > #V2 0.91777 1.00000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 2 > # V1 V2 > #V1 1.000000 0.987857 > #V2 0.987857 1.000000 > #-------------------------------------------------------------------------------------------------- > #dataf$class: 3 > # V1 V2 > #V1 1.0000000 0.7318938 > #V2 0.7318938 1.0000000 > > #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and 0.7318938, but I think that tapply can works better, if I can solve the problem.You might want to have a look at the plyr package: install.packages("plyr") library(plyr) # You can easily control the output data type: # d = data.frame, a = array, l = list ddply(dataf, .(class), function(df) data.frame(cor(df[, 1:2]))) daply(dataf, .(class), function(df) cor(df[, 1:2])) dlply(dataf, .(class), function(df) cor(df[, 1:2])) # Or for the minimal value you want ddply(dataf, .(class), function(df) cor(df$V1, df$V2)) # Note that plyr preserves labels so it's easier to match up with the original data # Learn more at http://had.co.nz/plyr Hadley -- http://had.co.nz/