Matthew Keller
2007-Mar-16 15:04 UTC
[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Hi all,
[this is a bit hard to describe, so if my initial description is
confusing, please try running my code below]
#WHAT I'M TRYING TO DO
I'd appreciate any help in trying to speed up some code. I've written
a script that converts a matrix of integers (usually between 1-10,000
- these represent allele names) into two new matrices of normally
distributed values (representing genetic effects), such that a given
number in the integer (allele name) matrix always corresponds to the
*same* randomly distributed (genetic) effects.
For example, every time my function sees the allele name "3782", it
converts that integer into the same two effects (e.g., -.372 1.727),
which follow normal distributions (these effects can be correlated;
below I've set their correlation to .5). I have an entire matrix of
integers, and am converting those into two entire matrices of effects.
#HOW I'VE DONE IT SO FAR
To get the correlations between the effects, I've used the mvrnorm
function from "MASS"
To convert the allele names to genetic effects, I've created a
function (make.effect) that resets the set.seed() to the allele name
each time its called.
To get the matrix of genetic effects, I use sapply.
#THE PROBLEM
The problem is that I often need to convert matrices that have 500K
cells, and do this over several hundred iterations, so it is quite
slow. If anyone has ideas to speed this up (e.g., some clever way to
convert a matrix of integers to a matrix of effects without using
sapply), I would be forever indebted.
##MY CODE
library("MASS")
##run this example to see what I'm talking about above
make.effects <- function(x,mn=0,var=1,cov=.5){
set.seed(x)
return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}
(alleles <-
matrix(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
a12 <-
array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
(a1 <- a12[1,,])
(a2 <- a12[2,,])
#I've set the population correlation = .5; empirical corr=.4635
cor(as.vector(a1),as.vector(a2))
##run this to see that the code begins to get pretty slow with even a
3000x4 matrix
system.time({
alleles <-
matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),1000),ncol=4)
a12 <-
array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
a1 <- a12[1,,]
a2 <- a12[2,,]
})
--
Matthew C Keller
Postdoctoral Fellow
Virginia Institute for Psychiatric and Behavioral Genetics
jim holtman
2007-Mar-16 17:07 UTC
[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Considering that the vast majority of your time is spent in the function
mvrnorm (on my system 5.7 out of 6.1 seconds). In your example that is
12000 calls to the function. To improve your speed you have to cut down the
number of calls to the function. For example, how many unique integers do
you have and can to do the calls for those and then substitute matching
values. Here is what Rprof showed:
total.time total.pct self.time self.pct
system.time 6.12 99.7 0.00 0.0
as.vector 6.06 98.7 0.18 2.9
FUN 6.06 98.7 0.12 2.0
array 6.06 98.7 0.10 1.6
lapply 6.06 98.7 0.00 0.0
sapply 6.06 98.7 0.00 0.0
eval 6.04 98.4 0.06 1.0
mvrnorm 5.72 93.2 0.34 5.5
eigen 2.58 42.0 0.52 8.5
or another way of looking at it:
0 6.1 root
1. 6.1 system.time
2. . 6.0 eval
3. . . 6.0 eval
4. . . . 6.0 array
5. . . . . 6.0 as.vector
6. . . . . . 6.0 sapply
7. . . . . . . 6.0 lapply
8. . . . . . . . 6.0 FUN
9. . . . . . . . . 5.7 mvrnorm
10. . . . . . . . . . 2.6 eigen
11. . . . . . . . . . . 1.2 sort.list
12. . . . . . . . . . . . 1.0 match.arg
13. . . . . . . . . . . . . 0.7 eval
On 3/16/07, Matthew Keller <mckellercran@gmail.com>
wrote:>
> Hi all,
>
> [this is a bit hard to describe, so if my initial description is
> confusing, please try running my code below]
>
> #WHAT I'M TRYING TO DO
> I'd appreciate any help in trying to speed up some code. I've
written
> a script that converts a matrix of integers (usually between 1-10,000
> - these represent allele names) into two new matrices of normally
> distributed values (representing genetic effects), such that a given
> number in the integer (allele name) matrix always corresponds to the
> *same* randomly distributed (genetic) effects.
>
> For example, every time my function sees the allele name "3782",
it
> converts that integer into the same two effects (e.g., -.372 1.727),
> which follow normal distributions (these effects can be correlated;
> below I've set their correlation to .5). I have an entire matrix of
> integers, and am converting those into two entire matrices of effects.
>
>
> #HOW I'VE DONE IT SO FAR
> To get the correlations between the effects, I've used the mvrnorm
> function from "MASS"
>
> To convert the allele names to genetic effects, I've created a
> function (make.effect) that resets the set.seed() to the allele name
> each time its called.
>
> To get the matrix of genetic effects, I use sapply.
>
>
> #THE PROBLEM
> The problem is that I often need to convert matrices that have 500K
> cells, and do this over several hundred iterations, so it is quite
> slow. If anyone has ideas to speed this up (e.g., some clever way to
> convert a matrix of integers to a matrix of effects without using
> sapply), I would be forever indebted.
>
>
> ##MY CODE
>
> library("MASS")
>
> ##run this example to see what I'm talking about above
>
> make.effects <- function(x,mn=0,var=1,cov=.5){
> set.seed(x)
>
>
return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}
>
> (alleles <-
>
matrix(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
>
> a12 <- array(sapply(alleles,make.effects
> ),dim=c(2,nrow(alleles),ncol(alleles)))
> (a1 <- a12[1,,])
> (a2 <- a12[2,,])
>
> #I've set the population correlation = .5; empirical corr=.4635
> cor(as.vector(a1),as.vector(a2))
>
> ##run this to see that the code begins to get pretty slow with even a
> 3000x4 matrix
>
> system.time({
>
> alleles <-
>
matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),1000),ncol=4)
>
> a12 <- array(sapply(alleles,make.effects
> ),dim=c(2,nrow(alleles),ncol(alleles)))
> a1 <- a12[1,,]
> a2 <- a12[2,,]
>
> })
>
>
>
>
> --
> Matthew C Keller
> Postdoctoral Fellow
> Virginia Institute for Psychiatric and Behavioral Genetics
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
[[alternative HTML version deleted]]