Dear R users, I need to come up with an efficient method to compute the correlation (or at least, the euclidean distance if that's easier) between specific rows in a data frame (46,232 rows, 29 columns). The pairs of rows between which I want to find the correlation share a common value in one of the columns. So for example, in the following x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T)) x$id=factor(x$id) I would want to compute the correlation between the two rows (for cols a,b,c) that share the same id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is 1Gb): p=NULL for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])} Is there a more efficient way? I thought about apply/sapply etc but I don't think they'll work for rows and can't think of an intelligent way to make them work! The second problem is that I also need to know how many degrees of freedom (ie non missing pairs of values) were used in each correlation. Is there a way to also do this efficiently? I hope this makes sense! Thank you all very much in advance! Eleni
Eleni, A way to do this is to group the data first using 'split' and then sapply the dist function to this list. The slower step will be the split which took a couple of minutes on my laptop but sapply should not take more than a minute or so. size <- 10000 df <- data.frame( id=rep(sample(1:100000,size=size),2), a=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE), b=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE), c=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE)) df$id=factor(df$id) dfp <- split(df, df$id) sapply(dfp, dist) -Christos> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Eleni Rapsomaniki > Sent: Friday, August 01, 2008 2:45 PM > To: r-help at r-project.org > Subject: [R] correlation between rows of data.frame > > Dear R users, > > I need to come up with an efficient method to compute the > correlation (or at least, the euclidean distance if that's > easier) between specific rows in a data > frame (46,232 rows, 29 columns). The pairs of rows between > which I want to > find the correlation share a common value in one of the > columns. So for example, in the following > x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c( > NA,rnorm(10,0,1)),size=10000, > replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000, > replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T)) > x$id=factor(x$id) > > I would want to compute the correlation between the two rows > (for cols a,b,c) that share the same id. Using a for loop and > dist() works but takes a long time (>1 hour, my RAM is > 1Gb): > p=NULL > for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])} > > Is there a more efficient way? I thought about apply/sapply > etc but I don't think they'll work for rows and can't think > of an intelligent way to make them work! > The second problem is that I also need to know how many > degrees of freedom (ie non missing pairs of values) were used > in each correlation. Is there a way to also do this efficiently? > > I hope this makes sense! Thank you all very much in advance! > > Eleni > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
You can convert to a numeric matrix to make the operations faster:> x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=20000,+ replace=T),b=sample(c(NA,rnorm(10,0,1)),size=20000, + replace=T),c=sample(c(NA,rnorm(10,0,1)),size=20000, replace=T))> x$id=factor(x$id) > # convert the numeric data for the 'dist' to a matrix for faster processing > x.m <- cbind(x$a, x$b, x$c) > # create a list of indices of each 'id' > x.i <- split(seq(nrow(x)), x$id) > system.time(x.d <- sapply(x.i, function(z) dist(x.m[z,])))user system elapsed 1.56 0.00 1.67> > > str(x.d)Named num [1:10000] 1.333 2.522 1.566 3.822 0.905 ... - attr(*, "names")= chr [1:10000] "15" "28" "37" "78" ...>On Fri, Aug 1, 2008 at 2:45 PM, Eleni Rapsomaniki <e.rapsomaniki at mail.cryst.bbk.ac.uk> wrote:> Dear R users, > > I need to come up with an efficient method to compute the correlation (or at > least, the euclidean distance if that's easier) between specific rows in a data > frame (46,232 rows, 29 columns). The pairs of rows between which I want to > find the correlation share a common value in one of the columns. So for > example, > in the following > x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000, > replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000, > replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T)) > x$id=factor(x$id) > > I would want to compute the correlation between the two rows (for cols a,b,c) > that share the same > id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is > 1Gb): > p=NULL > for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])} > > Is there a more efficient way? I thought about apply/sapply etc but I don't > think they'll work for rows and can't think of an intelligent way to make them > work! > The second problem is that I also need to know how many degrees of freedom (ie > non missing pairs of values) were used in each correlation. Is there a way to > also do this efficiently? > > I hope this makes sense! Thank you all very much in advance! > > Eleni > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?