thr3ads.net - R help - [R] Top N correlations from 'cor' for very large datasets being run many times [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Obi Griffith

2005-Jun-10 00:56 UTC

[R] Top N correlations from 'cor' for very large datasets being run many times

I am doing an analysis that requires me to calculate correlations for a matrix
of 15,000 rows x 50 columns.  For each row I want to calculate the correlation
to all other rows and then for each row, find the n (say 10) most correlated
rows.  If read in the 15,000 x 50 data from file and pass it to 'cor',
this function quite appropriately (and very quickly) calculates all possible row
by row comparisons and outputs a matrix of the results.  The problem is that
this matrix is exceedingly large (approx 1GB).  I want to run this analysis
thousands of times on a cluster and thus each job must be below 1GB (otherwise
I'd just do it on a large memory machine - where it works fine).  Since I am
only interested in the top n correlations for each row, I would prefer to only
store these correlations.  However, If I use a loop strategy to calculate
correlations and only keep the ones I want, it runs extremely slowly.  The
correlations for one row (versus all others) actually takes as long as all rows
versus all rows using the non-loop strategy!  Two questions:

1) Does this performance difference make sense?  I expected looping to be slower
but not that much slower.
2) Is there a way that I can pass the data matrix to 'cor' but only get
back the top n correlations for each row in the output matrix?  Or, is there
another way to get correlations quickly but only store the best results?

Any help would be greatly appreciated.  Obi

#The nice R way to get all possible correlations quickly - too much memory used
file1 = read.table("test.txt", header=F, quote="",
sep="\t", comment.char="", as.is=1)
file1_cor = cor(t(file1), method = "pearson", use =
"pairwise.complete.obs")
diag(file1_cor) = NA #Set correlation to self as NA
for (i in 1:15000){
  corrs=file1_cor[,i]
  corrs_ordered=order(corrs,decreasing=TRUE)  #Order correlations from largest
to smallest
  top_corrs=corrs[corrs_ordered[1:n]] #Get top n correlations - these would be
added to some data structure and used for subsequent analysis
}

#The not so nice way to get all possible correlations for each row and then
store only those that I want to keep. - too slow
file1 = read.table("test.txt", header=F, quote="",
sep="\t", comment.char="", as.is=1)
for (i in 1:15000){
  corrs = vector(length=15000)
  for (j in 1:15000){
  cor_ij = cor(as.numeric(file1[i,]), as.numeric(file1[j,]), method =
"pearson", use = "pairwise.complete.obs")
  corrs[j]=file1_gene_cor
  }
  corrs[i]=NA; #Set correlation to self as NA
  corrs_ordered=order(corrs,decreasing=TRUE)  #Order correlations from largest
to smallest
  top_corrs=corrs[corrs_ordered[1:n]] #Get top n correlations - these would be
added to some data structure and used for subsequent analysis
}


	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more maybe matching threads

R help - Jun 2005 - Top N correlations from 'cor' for very large datasets being run many times

[R] Top N correlations from 'cor' for very large datasets being run many times

Seemingly Similar Threads

Wisdom of the Ancients