thr3ads.net - R help - [R] "tapply versus by" in function with more than 1 arguments [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Cézar Freitas

2008-Oct-01 12:21 UTC

[R] "tapply versus by" in function with more than 1 arguments

Hi. I searched the list and didn't found nothing similar to this. I
simplified my example like below:

#I need calculate correlation (for example) between 2 columns classified by a
third one at a data.frame, like below:

#number of rows
nr = 10

#the third column is to enforce that I need correlation on two variables only
dataf =
as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4))
names(dataf)[4] = "class"

#> dataf
#            V1             V2                V3                 class
#1   0.56933020      1.2529931     0.30774422     1
#2   0.41702299     -1.6441547     0.76140046     1
#3  -1.07671647     -4.8747575     0.43706944     1
#4  -1.97701167      1.3015196     0.04390175     2
#5   0.56501325      1.8597720     0.08174124     2
#6   0.70068638      1.7922641     0.74730126     2
#7  -1.39956177     -1.9918904     0.64521918     3
#8   0.27086664      0.3745362     0.61026133     3
#9   0.04282347      3.7360407     0.48696109     3
#10 -0.34262654      0.7933674    0.09824913     3

#I tried:

tapply(dataf$V1, dataf$class, cor, dataf$V2)
#Error FUN(X[[1L]], ...) : incompatible dimensions

tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class))
#Error FUN(X[[1L]], ...) : incompatible dimensions

#But using "by" I obtain:

by(dataf[,c("V1","V2")], dataf$class, cor)

#dataf$class: 1
#        V1      V2
#V1 1.00000 0.91777
#V2 0.91777 1.00000
#--------------------------------------------------------------------------------------------------
#dataf$class: 2
#         V1       V2
#V1 1.000000 0.987857
#V2 0.987857 1.000000
#--------------------------------------------------------------------------------------------------
#dataf$class: 3
#          V1        V2
#V1 1.0000000 0.7318938
#V2 0.7318938 1.0000000

#My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and
0.7318938, but I think that tapply can works better, if I can solve the problem.

Thanks,
Cezar


      Novos endereços, o Yahoo! que você conhece. Crie um email novo com a sua
cara @ymail.com ou @rocketmail.com.
http://br.new.mail.yahoo.com/addresses
	[[alternative HTML version deleted]]

Henrique Dallazuanna

2008-Oct-01 15:59 UTC

head link

[R] "tapply versus by" in function with more than 1 arguments

Try this:

sapply(by(dataf[,c("V1","V2")], dataf$class, cor),
'[', 3)



On Wed, Oct 1, 2008 at 9:21 AM, C?zar Freitas <cafanselmo12 at
yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I
simplified my example like below:
>
> #I need calculate correlation (for example) between 2 columns classified by
a third one at a data.frame, like below:
>
> #number of rows
> nr = 10
>
> #the third column is to enforce that I need correlation on two variables
only
> dataf =
as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4))
> names(dataf)[4] = "class"
>
> #> dataf
> #            V1             V2                V3                 class
> #1   0.56933020      1.2529931     0.30774422     1
> #2   0.41702299     -1.6441547     0.76140046     1
> #3  -1.07671647     -4.8747575     0.43706944     1
> #4  -1.97701167      1.3015196     0.04390175     2
> #5   0.56501325      1.8597720     0.08174124     2
> #6   0.70068638      1.7922641     0.74730126     2
> #7  -1.39956177     -1.9918904     0.64521918     3
> #8   0.27086664      0.3745362     0.61026133     3
> #9   0.04282347      3.7360407     0.48696109     3
> #10 -0.34262654      0.7933674    0.09824913     3
>
> #I tried:
>
> tapply(dataf$V1, dataf$class, cor, dataf$V2)
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class))
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> #But using "by" I obtain:
>
> by(dataf[,c("V1","V2")], dataf$class, cor)
>
> #dataf$class: 1
> #        V1      V2
> #V1 1.00000 0.91777
> #V2 0.91777 1.00000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 2
> #         V1       V2
> #V1 1.000000 0.987857
> #V2 0.987857 1.000000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 3
> #          V1        V2
> #V1 1.0000000 0.7318938
> #V2 0.7318938 1.0000000
>
> #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and
0.7318938, but I think that tapply can works better, if I can solve the problem.
>
> Thanks,
> Cezar
>
>
>      Novos endere?os, o Yahoo! que voc? conhece. Crie um email novo com a
sua cara @ymail.com ou @rocketmail.com.
> http://br.new.mail.yahoo.com/addresses
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Henrique Dallazuanna
Curitiba-Paran?-Brasil
25? 25' 40" S 49? 16' 22" O

Gabor Grothendieck

2008-Oct-01 16:06 UTC

head link

[R] "tapply versus by" in function with more than 1 arguments

The first tapply in your question subsets V1 but not V2 so they are
of different length.  To subset both tapply over the row names and
perform the subsetting in the function:

tapply(rownames(dataf), dataf$class, function(r) cor(dataf[r, "V1"],
dataf[r, "V2"]))

or

tapply(rownames(dataf), dataf$class, function(r) with(dataf[r, ], cor(V1, V2)))


On Wed, Oct 1, 2008 at 8:21 AM, C?zar Freitas <cafanselmo12 at
yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I
simplified my example like below:
>
> #I need calculate correlation (for example) between 2 columns classified by
a third one at a data.frame, like below:
>
> #number of rows
> nr = 10
>
> #the third column is to enforce that I need correlation on two variables
only
> dataf =
as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4))
> names(dataf)[4] = "class"
>
> #> dataf
> #            V1             V2                V3                 class
> #1   0.56933020      1.2529931     0.30774422     1
> #2   0.41702299     -1.6441547     0.76140046     1
> #3  -1.07671647     -4.8747575     0.43706944     1
> #4  -1.97701167      1.3015196     0.04390175     2
> #5   0.56501325      1.8597720     0.08174124     2
> #6   0.70068638      1.7922641     0.74730126     2
> #7  -1.39956177     -1.9918904     0.64521918     3
> #8   0.27086664      0.3745362     0.61026133     3
> #9   0.04282347      3.7360407     0.48696109     3
> #10 -0.34262654      0.7933674    0.09824913     3
>
> #I tried:
>
> tapply(dataf$V1, dataf$class, cor, dataf$V2)
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class))
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> #But using "by" I obtain:
>
> by(dataf[,c("V1","V2")], dataf$class, cor)
>
> #dataf$class: 1
> #        V1      V2
> #V1 1.00000 0.91777
> #V2 0.91777 1.00000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 2
> #         V1       V2
> #V1 1.000000 0.987857
> #V2 0.987857 1.000000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 3
> #          V1        V2
> #V1 1.0000000 0.7318938
> #V2 0.7318938 1.0000000
>
> #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and
0.7318938, but I think that tapply can works better, if I can solve the problem.
>
> Thanks,
> Cezar
>
>
>      Novos endere?os, o Yahoo! que voc? conhece. Crie um email novo com a
sua cara @ymail.com ou @rocketmail.com.
> http://br.new.mail.yahoo.com/addresses
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

hadley wickham

2008-Oct-01 16:27 UTC

head link

[R] "tapply versus by" in function with more than 1 arguments

On Wed, Oct 1, 2008 at 7:21 AM, C?zar Freitas <cafanselmo12 at
yahoo.com.br> wrote:> Hi. I searched the list and didn't found nothing similar to this. I
simplified my example like below:
>
> #I need calculate correlation (for example) between 2 columns classified by
a third one at a data.frame, like below:
>
> #number of rows
> nr = 10
>
> #the third column is to enforce that I need correlation on two variables
only
> dataf =
as.data.frame(matrix(c(rnorm(nr),rnorm(nr)*2,runif(nr),sort(c(1,1,2,2,3,3,sample(1:3,nr-6,replace=TRUE)))),ncol=4))
> names(dataf)[4] = "class"
>
> #> dataf
> #            V1             V2                V3                 class
> #1   0.56933020      1.2529931     0.30774422     1
> #2   0.41702299     -1.6441547     0.76140046     1
> #3  -1.07671647     -4.8747575     0.43706944     1
> #4  -1.97701167      1.3015196     0.04390175     2
> #5   0.56501325      1.8597720     0.08174124     2
> #6   0.70068638      1.7922641     0.74730126     2
> #7  -1.39956177     -1.9918904     0.64521918     3
> #8   0.27086664      0.3745362     0.61026133     3
> #9   0.04282347      3.7360407     0.48696109     3
> #10 -0.34262654      0.7933674    0.09824913     3
>
> #I tried:
>
> tapply(dataf$V1, dataf$class, cor, dataf$V2)
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> tapply(dataf$V1, dataf$class, cor, tapply(dataf$V2, dataf$class))
> #Error FUN(X[[1L]], ...) : incompatible dimensions
>
> #But using "by" I obtain:
>
> by(dataf[,c("V1","V2")], dataf$class, cor)
>
> #dataf$class: 1
> #        V1      V2
> #V1 1.00000 0.91777
> #V2 0.91777 1.00000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 2
> #         V1       V2
> #V1 1.000000 0.987857
> #V2 0.987857 1.000000
>
#--------------------------------------------------------------------------------------------------
> #dataf$class: 3
> #          V1        V2
> #V1 1.0000000 0.7318938
> #V2 0.7318938 1.0000000
>
> #My interest is on cor(V1,V2)[1,2], so I can take 0.91777, 0.987857 and
0.7318938, but I think that tapply can works better, if I can solve the problem.
You might want to have a look at the plyr package:

install.packages("plyr")
library(plyr)

# You can easily control the output data type:
#   d = data.frame, a = array, l = list
ddply(dataf, .(class), function(df) data.frame(cor(df[, 1:2])))
daply(dataf, .(class), function(df) cor(df[, 1:2]))
dlply(dataf, .(class), function(df) cor(df[, 1:2]))

# Or for the minimal value you want
ddply(dataf, .(class), function(df) cor(df$V1, df$V2))

# Note that plyr preserves labels so it's easier to match up with the
original data
# Learn more at http://had.co.nz/plyr

Hadley

-- 
http://had.co.nz/

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Oct 2008 - "tapply versus by" in function with more than 1 arguments

[R] "tapply versus by" in function with more than 1 arguments

[R] "tapply versus by" in function with more than 1 arguments

[R] "tapply versus by" in function with more than 1 arguments

[R] "tapply versus by" in function with more than 1 arguments

Possibly Parallel Threads