suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). my sets of vectors are arranged as data frames x & y (vector=column): x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) cor(x,y) returns a _matrix_ of all pairwise correlations: cor(x,y) d e f a 0.2763696 -0.3523757 -0.373518870 b 0.5892742 -0.1969161 -0.007159589 c 0.3094301 0.1111997 -0.094970748 which is _not_ what I want. I want diag(cor(x,y)) but without the N^2 calculations. thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://iris.org.il http://americancensorship.org http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org Never argue with an idiot: he has more experience with idiotic arguments.
sapply(1:NCOL(x), function(n) cor(x[n], y[n])) is a quick and dirty way, though probably not optimal. Michael On Thu, Feb 23, 2012 at 5:10 PM, Sam Steingold <sds at gnu.org> wrote:> suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. > I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). > my sets of vectors are arranged as data frames x & y (vector=column): > > ?x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) > ?y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) > > cor(x,y) returns a _matrix_ of all pairwise correlations: > > ?cor(x,y) > ? ? ? ? ?d ? ? ? ? ?e ? ? ? ? ? ?f > a 0.2763696 -0.3523757 -0.373518870 > b 0.5892742 -0.1969161 -0.007159589 > c 0.3094301 ?0.1111997 -0.094970748 > > which is _not_ what I want. > > I want diag(cor(x,y)) but without the N^2 calculations. > > thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > http://www.childpsy.net/ http://iris.org.il http://americancensorship.org > http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org > Never argue with an idiot: he has more experience with idiotic arguments. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Use 1:n as an index. e.g. sapply(1:n, function(i) cor(x[,i],y[,i])) -- Bert On Thu, Feb 23, 2012 at 2:10 PM, Sam Steingold <sds at gnu.org> wrote:> suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. > I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). > my sets of vectors are arranged as data frames x & y (vector=column): > > ?x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) > ?y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) > > cor(x,y) returns a _matrix_ of all pairwise correlations: > > ?cor(x,y) > ? ? ? ? ?d ? ? ? ? ?e ? ? ? ? ? ?f > a 0.2763696 -0.3523757 -0.373518870 > b 0.5892742 -0.1969161 -0.007159589 > c 0.3094301 ?0.1111997 -0.094970748 > > which is _not_ what I want. > > I want diag(cor(x,y)) but without the N^2 calculations. > > thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > http://www.childpsy.net/ http://iris.org.il http://americancensorship.org > http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org > Never argue with an idiot: he has more experience with idiotic arguments. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
On Thu, Feb 23, 2012 at 3:24 PM, Bert Gunter <gunter.berton at gene.com> wrote:> Use 1:n as an index. > > e.g. > sapply(1:n, function(i) cor(x[,i],y[,i]))## sapply is a good solution (the only one I could think of too), but not always worth it: # for 100 x 1000 x <- data.frame(matrix(rnorm(100000),nc=1000)) y <- data.frame(matrix(rnorm(100000),nc=1000)) system.time(diag(cor(x,y))) # user system elapsed # 0.592 0.008 0.623 system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) # user system elapsed # 0.384 0.000 0.412 # Great. but for 10 x 1000 x <- data.frame(matrix(rnorm(10000),nc=1000)) y <- data.frame(matrix(rnorm(10000),nc=1000)) system.time(diag(cor(x,y))) # user system elapsed # 0.256 0.008 0.279 system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) # user system elapsed # 0.376 0.000 0.388 # or 100 x 100 system.time(diag(cor(x,y))) # user system elapsed # 0.016 0.000 0.014 system.time(sapply(1:100,function(i) cor(x[,i],y[,i]))) # user system elapsed # 0.036 0.000 0.036 # Not so great. Bottom line, as always, it depends. Cheers Elai> > -- Bert > > > > On Thu, Feb 23, 2012 at 2:10 PM, Sam Steingold <sds at gnu.org> wrote: >> suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. >> I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). >> my sets of vectors are arranged as data frames x & y (vector=column): >> >> ?x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) >> ?y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) >> >> cor(x,y) returns a _matrix_ of all pairwise correlations: >> >> ?cor(x,y) >> ? ? ? ? ?d ? ? ? ? ?e ? ? ? ? ? ?f >> a 0.2763696 -0.3523757 -0.373518870 >> b 0.5892742 -0.1969161 -0.007159589 >> c 0.3094301 ?0.1111997 -0.094970748 >> >> which is _not_ what I want. >> >> I want diag(cor(x,y)) but without the N^2 calculations. >> >> thanks. >> >> -- >> Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 >> http://www.childpsy.net/ http://iris.org.il http://americancensorship.org >> http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org >> Never argue with an idiot: he has more experience with idiotic arguments. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Bert Gunter > Genentech Nonclinical Biostatistics > > Internal Contact Info: > Phone: 467-7374 > Website: > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Elai: Thank you.You make an excellent point. cor() is implemented at the C level (via a .internal call) whereas sapply implements an interpreted loop that has to issue the call each time through the loop (with some shortcuts/tricks to reduce overhead). So the operations count of the original poster is completely bogus. As you say, "it depends..." . For this reason, it is generally a bad idea to waste much time on code efficiency unless you really need to, which these days is not often (and there are certainly arenas where this statement is false). More important is to focus on code clarity, flexibility, debuggabuility, etc. Best, Bert On Thu, Feb 23, 2012 at 2:52 PM, ilai <keren at math.montana.edu> wrote:> On Thu, Feb 23, 2012 at 3:24 PM, Bert Gunter <gunter.berton at gene.com> wrote: >> Use 1:n as an index. >> >> e.g. >> sapply(1:n, function(i) cor(x[,i],y[,i])) > > ## sapply is a good solution (the only one I could think of too), but > not always worth it: > > # for 100 x 1000 > ?x <- data.frame(matrix(rnorm(100000),nc=1000)) > ?y <- data.frame(matrix(rnorm(100000),nc=1000)) > ?system.time(diag(cor(x,y))) > # ? user ?system elapsed > # ?0.592 ? 0.008 ? 0.623 > system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) > # ? user ?system elapsed > # ?0.384 ? 0.000 ? 0.412 > > # Great. but for 10 x 1000 > x <- data.frame(matrix(rnorm(10000),nc=1000)) > y <- data.frame(matrix(rnorm(10000),nc=1000)) > system.time(diag(cor(x,y))) > # ? user ?system elapsed > # ?0.256 ? 0.008 ? 0.279 > system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) > # ? user ?system elapsed > # ?0.376 ? 0.000 ? 0.388 > > # or 100 x 100 > ?system.time(diag(cor(x,y))) > # ? user ?system elapsed > # ?0.016 ? 0.000 ? 0.014 > ?system.time(sapply(1:100,function(i) cor(x[,i],y[,i]))) > # ? user ?system elapsed > # ?0.036 ? 0.000 ? 0.036 > > # Not so great. > > Bottom line, as always, it depends. > > Cheers > Elai > > > > >> >> -- Bert >> >> >> >> On Thu, Feb 23, 2012 at 2:10 PM, Sam Steingold <sds at gnu.org> wrote: >>> suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. >>> I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). >>> my sets of vectors are arranged as data frames x & y (vector=column): >>> >>> ?x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) >>> ?y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) >>> >>> cor(x,y) returns a _matrix_ of all pairwise correlations: >>> >>> ?cor(x,y) >>> ? ? ? ? ?d ? ? ? ? ?e ? ? ? ? ? ?f >>> a 0.2763696 -0.3523757 -0.373518870 >>> b 0.5892742 -0.1969161 -0.007159589 >>> c 0.3094301 ?0.1111997 -0.094970748 >>> >>> which is _not_ what I want. >>> >>> I want diag(cor(x,y)) but without the N^2 calculations. >>> >>> thanks. >>> >>> -- >>> Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 >>> http://www.childpsy.net/ http://iris.org.il http://americancensorship.org >>> http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org >>> Never argue with an idiot: he has more experience with idiotic arguments. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> -- >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> >> Internal Contact Info: >> Phone: 467-7374 >> Website: >> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
> ## sapply is a good solution (the only one I could think of too), but > not always worth it:Also look at vapply(). It is like sapply() but you have to tell it what type and size of output you expect FUN to return. E.g., change sapply(1:n, FUN=function(i) cor(x[,i],y[,i])) to vapply(1:n, FUN=function(i) cor(x[,i],y[,i]), FUN.VALUE=0.0) It can save time and space in formatting the outputs of FUN because it knows what it will be ahead of time, but its biggest advantage is that it makes it clear (to the code writer and readers) what it will return. It throws an error if FUN does not return what you say it will. Also, vapply works nicely for zero-length inputs. E.g., vapply(list(), range, FUN.VALUE=numeric(2)) returns a 2 by 0 matrix while sapply(list(), range) returns list() and you have to add if(length(x)==0) statements to your code to make it work in the edge cases, obscuring the basic algorithm. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of ilai > Sent: Thursday, February 23, 2012 2:52 PM > To: Bert Gunter > Cc: r-help at r-project.org; sds at gnu.org > Subject: Re: [R] cor() on sets of vectors > > On Thu, Feb 23, 2012 at 3:24 PM, Bert Gunter <gunter.berton at gene.com> wrote: > > Use 1:n as an index. > > > > e.g. > > sapply(1:n, function(i) cor(x[,i],y[,i])) > > ## sapply is a good solution (the only one I could think of too), but > not always worth it: > > # for 100 x 1000 > x <- data.frame(matrix(rnorm(100000),nc=1000)) > y <- data.frame(matrix(rnorm(100000),nc=1000)) > system.time(diag(cor(x,y))) > # user system elapsed > # 0.592 0.008 0.623 > system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) > # user system elapsed > # 0.384 0.000 0.412 > > # Great. but for 10 x 1000 > x <- data.frame(matrix(rnorm(10000),nc=1000)) > y <- data.frame(matrix(rnorm(10000),nc=1000)) > system.time(diag(cor(x,y))) > # user system elapsed > # 0.256 0.008 0.279 > system.time(sapply(1:1000,function(i) cor(x[,i],y[,i]))) > # user system elapsed > # 0.376 0.000 0.388 > > # or 100 x 100 > system.time(diag(cor(x,y))) > # user system elapsed > # 0.016 0.000 0.014 > system.time(sapply(1:100,function(i) cor(x[,i],y[,i]))) > # user system elapsed > # 0.036 0.000 0.036 > > # Not so great. > > Bottom line, as always, it depends. > > Cheers > Elai > > > > > > > > -- Bert > > > > > > > > On Thu, Feb 23, 2012 at 2:10 PM, Sam Steingold <sds at gnu.org> wrote: > >> suppose I have two sets of vectors: x1,x2,...,xN and y1,y2,...,yN. > >> I want N correlations: cor(x1,y1), cor(x2,y2), ..., cor(xN,yN). > >> my sets of vectors are arranged as data frames x & y (vector=column): > >> > >> ?x <- data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10)) > >> ?y <- data.frame(d=rnorm(10),e=rnorm(10),f=rnorm(10)) > >> > >> cor(x,y) returns a _matrix_ of all pairwise correlations: > >> > >> ?cor(x,y) > >> ? ? ? ? ?d ? ? ? ? ?e ? ? ? ? ? ?f > >> a 0.2763696 -0.3523757 -0.373518870 > >> b 0.5892742 -0.1969161 -0.007159589 > >> c 0.3094301 ?0.1111997 -0.094970748 > >> > >> which is _not_ what I want. > >> > >> I want diag(cor(x,y)) but without the N^2 calculations. > >> > >> thanks. > >> > >> -- > >> Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > >> http://www.childpsy.net/ http://iris.org.il http://americancensorship.org > >> http://dhimmi.com http://www.PetitionOnline.com/tap12009/ http://jihadwatch.org > >> Never argue with an idiot: he has more experience with idiotic arguments. > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > > > > > -- > > > > Bert Gunter > > Genentech Nonclinical Biostatistics > > > > Internal Contact Info: > > Phone: 467-7374 > > Website: > > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb- > home.htm > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.