thr3ads.net - R help - [R] correlation between rows of data.frame [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Eleni Rapsomaniki

2008-Aug-01 18:45 UTC

[R] correlation between rows of data.frame

Dear R users,

I need to come up with an efficient method to compute the correlation (or at
least, the euclidean distance if that's easier) between specific rows in a
data
frame (46,232 rows,    29 columns). The pairs of rows between which I want to
find the correlation share a common value in one of the columns. So for
example,
in the following 
 x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
x$id=factor(x$id)

I would want to compute the correlation between the two rows (for cols a,b,c)
that share the same
id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM
is
1Gb):
p=NULL
 for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}

Is there a more efficient way? I thought about apply/sapply etc but I don't
think they'll work for rows and can't think of an intelligent way to
make them
work!
The second problem is that I also need to know how many degrees of freedom (ie
non missing pairs of values) were used in each correlation. Is there a way to
also do this efficiently?

I hope this makes sense! Thank you all very much in advance!

Eleni

Christos Hatzis

2008-Aug-01 19:53 UTC

head link

[R] correlation between rows of data.frame

Eleni,

A way to do this is to group the data first using 'split' and then
sapply
the dist function to this list.
The slower step will be the split which took a couple of minutes on my
laptop but sapply should not take more than a minute or so.

size <- 10000
df <- data.frame(
	id=rep(sample(1:100000,size=size),2),
	a=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
	b=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
	c=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE))

df$id=factor(df$id)

dfp <- split(df, df$id)
sapply(dfp, dist) 

-Christos
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Eleni Rapsomaniki
> Sent: Friday, August 01, 2008 2:45 PM
> To: r-help at r-project.org
> Subject: [R] correlation between rows of data.frame
> 
> Dear R users,
> 
> I need to come up with an efficient method to compute the 
> correlation (or at least, the euclidean distance if that's 
> easier) between specific rows in a data
> frame (46,232 rows,    29 columns). The pairs of rows between 
> which I want to
> find the correlation share a common value in one of the 
> columns. So for example, in the following  
> x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(
> NA,rnorm(10,0,1)),size=10000,
> replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
> x$id=factor(x$id)
> 
> I would want to compute the correlation between the two rows 
> (for cols a,b,c) that share the same id. Using a for loop and 
> dist() works but takes a long time (>1 hour, my RAM is
> 1Gb):
> p=NULL
>  for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}
> 
> Is there a more efficient way? I thought about apply/sapply 
> etc but I don't think they'll work for rows and can't think 
> of an intelligent way to make them work!
> The second problem is that I also need to know how many 
> degrees of freedom (ie non missing pairs of values) were used 
> in each correlation. Is there a way to also do this efficiently?
> 
> I hope this makes sense! Thank you all very much in advance!
> 
> Eleni
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>

jim holtman

2008-Aug-02 09:17 UTC

head link

[R] correlation between rows of data.frame

You can convert to a numeric matrix to make the operations faster:
> 
x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=20000,+ replace=T),b=sample(c(NA,rnorm(10,0,1)),size=20000,
+ replace=T),c=sample(c(NA,rnorm(10,0,1)),size=20000,
replace=T))> x$id=factor(x$id)
> # convert the numeric data for the 'dist' to a matrix for faster
processing
> x.m <- cbind(x$a, x$b, x$c)
> # create a list of indices of each 'id'
> x.i <- split(seq(nrow(x)), x$id)
> system.time(x.d <- sapply(x.i, function(z) dist(x.m[z,])))   user  system elapsed
   1.56    0.00    1.67>
>
> str(x.d) Named num [1:10000] 1.333 2.522 1.566 3.822 0.905 ...
 - attr(*, "names")= chr [1:10000] "15" "28"
"37" "78" ...>

On Fri, Aug 1, 2008 at 2:45 PM, Eleni Rapsomaniki
<e.rapsomaniki at mail.cryst.bbk.ac.uk> wrote:> Dear R users,
>
> I need to come up with an efficient method to compute the correlation (or
at
> least, the euclidean distance if that's easier) between specific rows
in a data
> frame (46,232 rows,    29 columns). The pairs of rows between which I want
to
> find the correlation share a common value in one of the columns. So for
> example,
> in the following
> 
x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
> x$id=factor(x$id)
>
> I would want to compute the correlation between the two rows (for cols
a,b,c)
> that share the same
> id. Using a for loop and dist() works but takes a long time (>1 hour, my
RAM is
> 1Gb):
> p=NULL
>  for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}
>
> Is there a more efficient way? I thought about apply/sapply etc but I
don't
> think they'll work for rows and can't think of an intelligent way
to make them
> work!
> The second problem is that I also need to know how many degrees of freedom
(ie
> non missing pairs of values) were used in each correlation. Is there a way
to
> also do this efficiently?
>
> I hope this makes sense! Thank you all very much in advance!
>
> Eleni
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Reasonably Related Threads

Search for more reasonably related threads

R help - Aug 2008 - correlation between rows of data.frame

[R] correlation between rows of data.frame

[R] correlation between rows of data.frame

[R] correlation between rows of data.frame

Reasonably Related Threads