Matthieu Dubois
2007-Jul-24 07:24 UTC
[R] function optimization: reducing the computing time
Dear useRs,
I have written a function that implements a Bayesian method to
compare a patient's score on two tasks with that of a small control
group, as described in Crawford, J. and Garthwaite, P. (2007).
Comparison of a single case to a control or normative sample in
neuropsychology: Development of a bayesian approach. Cognitive
Neuropsychology, 24(4):343?372.
The function (see below) return the expected results, but the time
needed to compute is quite long (at least for a test that may be
routinely used). There is certainly room for improvement. It would
really be helpful if some experts of you may have a look ...
Thanks a lot.
Regards,
Matthieu
FUNCTION
----------
The function takes the performance on two tasks and estimate the
rarity (the p-value) of the difference between the patient's two
scores, in comparison to the difference i the controls subjects. A
standardized and an unstandardized version are provided (controlled
by the parameter standardized: T vs. F). Also, for congruency with
the original publication, both the raw data and summary statistics
could be used for the control group.
##################################################
# Bayesian (un)Standardized Difference Test
##################################################
#from Crawford and Garthwaite (2007) Cognitive Neuropsychology
# implemented by Matthieu Dubois, Matthieu.Dubois<at>psp.ucl.ac.be
#PACKAGE MCMCpack REQUIRED
# patient: a vector with the two scores; controls: matrix/data.frame
with the raw scores (one column per task)
# mean.c, sd.c, r, n: possibility to enter summaries statistics
(mean, standard deviation, correlation, group size)
# n.simul: number of simulations
# two-sided (Boolean): two-sided (T) vs. one-sided (F) Bayesian
Credible interval
# standardized (Boolean): standardized (T) vs. unstandardized (F) test
# values are: $p.value (one_tailed), $confidence.interval
crawford.BSDT <- function(patient, controls, mean.c=0, sd.c=0 , r=0,
n=0, na.rm=F, n.simul=100000, two.sided=T, standardized=T)
{
library(MCMCpack)
#if no summaries are entered, they are computed
if(missing(n))
{
if(!is.data.frame(controls)) controls <- as.data.frame(controls)
n <- dim(controls)[1]
mean.c <- mean(controls, na.rm=na.rm)
sd.c <- sd(controls, na.rm=na.rm)
na.method <- ifelse(na.rm,"complete.obs","all.obs")
r <- cor(controls[,1], controls[,2], na.method)
}
#variance/covariance matrix
s.xx <- (sd.c[1]^2) * (n-1)
s.yy <- (sd.c[2]^2) * (n-1)
s.xy <- sd.c[1] * sd.c[2] * r * (n-1)
A <- matrix(c(s.xx, s.xy, s.xy, s.yy), ncol=2)
#estimation function
if(standardized)
{
estimation <- function(patient, mean.c, n, A)
{
#estimation of a variance/covariance matrix (sigma)
sigma = riwish(n,A) #random obs. from an inverse-Wishart distribution
#estimation of the means (mu)
z <- rnorm(2)
T <- t(chol(sigma)) #Cholesky decomposition
mu <- mean.c + T %*% z/sqrt(n)
#standardization
z.x <- (patient[1]-mu[1]) / sqrt(sigma[1,1])
z.y <- (patient[2]-mu[2]) / sqrt(sigma[2,2])
rho.xy <- sigma[2.2] / sqrt(sigma[1,1]*sigma[2,2])
z.star <- (z.x - z.y) / sqrt(2-2*rho.xy)
#conditional p-value
p <- pnorm(z.star)
p
}
}
else
{
estimation <- function(patient, mean.c, n, A)
{
#estimation of a variance/covariance matrix (sigma)
sigma = riwish(n,A) #random obs. from an inverse-Wishart distribution
#estimation of the means (mu)
z <- rnorm(2)
T <- t(chol(sigma)) #Cholesky decomposition
mu <- mean.c + T %*% z/sqrt(n)
num <- (patient[1]-mu[1]) - (patient[2] - mu[2])
denom <- sqrt(sigma[1,1]+sigma[2,2]-(2*sigma[1,2]))
z.star <- num/denom
#conditional p-value
p <- pnorm(z.star)
p
}
}
#application
p <- replicate(n.simul, estimation(patient, mean.c, n, A))
#outputs
pval <- mean(p)
CI <- if(two.sided) 100*quantile(p,c(0.025,0.975)) else 100*quantile
(p,c(0.95))
output <- list(p.value=pval, confidence.interval=CI)
output
}
TIME ESTIMATION
--------------
# the values used in these examples are taken from the original paper
# system times are estimated for both the standardized and
unstandardized versions.
system.time(crawford.BSDT(c(95,105),mean.c=c(100,100),sd.c=c
(10,10),n=5,r=0.6, standardized=F))
user system elapsed
230.709 19.686 316.464
system.time(crawford.BSDT(c(90,110),mean.c=c(100,100),sd.c=c
(10,10),n=5,r=0.6, standardized=T))
user system elapsed
227.618 15.656 293.810
R version
-------
>sessionInfo()
R version 2.5.1 (2007-06-27)
powerpc-apple-darwin8.9.1
locale:
en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] "stats" "graphics" "grDevices"
"utils" "datasets"
[6] "methods" "base"
other attached packages:
MCMCpack MASS coda lattice
"0.8-2" "7.2-34" "0.11-2" "0.16-2"
Matthieu Dubois
Ph.D. Student
Cognition and Development Lab
Catholic University of Louvain
10, Place Cardinal Mercier
B-1348 Louvain-la-Neuve - Belgium
E-mail: Matthieu.Dubois at psp.ucl.ac.be
Web: http://www.code.ucl.ac.be/MatthieuDubois/
Prof Brian Ripley
2007-Jul-24 09:02 UTC
[R] function optimization: reducing the computing time
You need to make use of the profiling methods described in 'Writing R
Exensions'. My machine is about 4x faster than yours: I get
Each sample represents 0.02 seconds.
Total run time: 62.0800000000041 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self
total seconds self seconds name
100.00 62.08 0.00 0.00 "system.time"
99.94 62.04 0.00 0.00 "crawford.BSDT"
99.94 62.04 0.00 0.00 "eval"
99.10 61.52 1.00 0.62 "lapply"
99.10 61.52 0.00 0.00 "sapply"
99.00 61.46 0.00 0.00 "replicate"
98.61 61.22 2.26 1.40 "FUN"
98.26 61.00 3.32 2.06 "estimation"
83.92 52.10 0.26 0.16 "riwish"
83.67 51.94 4.25 2.64 "solve"
55.57 34.50 7.18 4.46 "solve.default"
51.68 32.08 3.77 2.34 "rwish"
...
so 84% of the time is being spent in riwish. Now given that A is fixed,
you should be able to speed that up by precomputing the constant parts of
the computation (and you can also precompute your 'T').
On Tue, 24 Jul 2007, Matthieu Dubois wrote:
> Dear useRs,
>
> I have written a function that implements a Bayesian method to
> compare a patient's score on two tasks with that of a small control
> group, as described in Crawford, J. and Garthwaite, P. (2007).
> Comparison of a single case to a control or normative sample in
> neuropsychology: Development of a bayesian approach. Cognitive
> Neuropsychology, 24(4):343??372.
>
> The function (see below) return the expected results, but the time
> needed to compute is quite long (at least for a test that may be
> routinely used). There is certainly room for improvement. It would
> really be helpful if some experts of you may have a look ...
>
> Thanks a lot.
> Regards,
>
> Matthieu
>
>
> FUNCTION
> ----------
> The function takes the performance on two tasks and estimate the
> rarity (the p-value) of the difference between the patient's two
> scores, in comparison to the difference i the controls subjects. A
> standardized and an unstandardized version are provided (controlled
> by the parameter standardized: T vs. F). Also, for congruency with
> the original publication, both the raw data and summary statistics
> could be used for the control group.
>
> ##################################################
> # Bayesian (un)Standardized Difference Test
> ##################################################
>
> #from Crawford and Garthwaite (2007) Cognitive Neuropsychology
> # implemented by Matthieu Dubois, Matthieu.Dubois<at>psp.ucl.ac.be
>
> #PACKAGE MCMCpack REQUIRED
>
> # patient: a vector with the two scores; controls: matrix/data.frame
> with the raw scores (one column per task)
> # mean.c, sd.c, r, n: possibility to enter summaries statistics
> (mean, standard deviation, correlation, group size)
> # n.simul: number of simulations
> # two-sided (Boolean): two-sided (T) vs. one-sided (F) Bayesian
> Credible interval
> # standardized (Boolean): standardized (T) vs. unstandardized (F) test
> # values are: $p.value (one_tailed), $confidence.interval
>
> crawford.BSDT <- function(patient, controls, mean.c=0, sd.c=0 , r=0,
> n=0, na.rm=F, n.simul=100000, two.sided=T, standardized=T)
> {
> library(MCMCpack)
>
> #if no summaries are entered, they are computed
> if(missing(n))
> {
> if(!is.data.frame(controls)) controls <- as.data.frame(controls)
> n <- dim(controls)[1]
> mean.c <- mean(controls, na.rm=na.rm)
> sd.c <- sd(controls, na.rm=na.rm)
>
> na.method <-
ifelse(na.rm,"complete.obs","all.obs")
>
> r <- cor(controls[,1], controls[,2], na.method)
> }
>
> #variance/covariance matrix
> s.xx <- (sd.c[1]^2) * (n-1)
> s.yy <- (sd.c[2]^2) * (n-1)
> s.xy <- sd.c[1] * sd.c[2] * r * (n-1)
>
> A <- matrix(c(s.xx, s.xy, s.xy, s.yy), ncol=2)
>
> #estimation function
> if(standardized)
> {
> estimation <- function(patient, mean.c, n, A)
> {
> #estimation of a variance/covariance matrix (sigma)
> sigma = riwish(n,A) #random obs. from an inverse-Wishart distribution
>
> #estimation of the means (mu)
> z <- rnorm(2)
> T <- t(chol(sigma)) #Cholesky decomposition
> mu <- mean.c + T %*% z/sqrt(n)
>
> #standardization
> z.x <- (patient[1]-mu[1]) / sqrt(sigma[1,1])
> z.y <- (patient[2]-mu[2]) / sqrt(sigma[2,2])
> rho.xy <- sigma[2.2] / sqrt(sigma[1,1]*sigma[2,2])
>
> z.star <- (z.x - z.y) / sqrt(2-2*rho.xy)
>
> #conditional p-value
> p <- pnorm(z.star)
> p
> }
> }
> else
> {
> estimation <- function(patient, mean.c, n, A)
> {
> #estimation of a variance/covariance matrix (sigma)
> sigma = riwish(n,A) #random obs. from an inverse-Wishart distribution
>
> #estimation of the means (mu)
> z <- rnorm(2)
> T <- t(chol(sigma)) #Cholesky decomposition
> mu <- mean.c + T %*% z/sqrt(n)
>
> num <- (patient[1]-mu[1]) - (patient[2] - mu[2])
> denom <- sqrt(sigma[1,1]+sigma[2,2]-(2*sigma[1,2]))
>
> z.star <- num/denom
>
> #conditional p-value
> p <- pnorm(z.star)
> p
> }
> }
>
> #application
> p <- replicate(n.simul, estimation(patient, mean.c, n, A))
>
> #outputs
> pval <- mean(p)
> CI <- if(two.sided) 100*quantile(p,c(0.025,0.975)) else 100*quantile
> (p,c(0.95))
> output <- list(p.value=pval, confidence.interval=CI)
> output
> }
>
>
>
> TIME ESTIMATION
> --------------
> # the values used in these examples are taken from the original paper
> # system times are estimated for both the standardized and
> unstandardized versions.
>
> system.time(crawford.BSDT(c(95,105),mean.c=c(100,100),sd.c=c
> (10,10),n=5,r=0.6, standardized=F))
>
> user system elapsed
> 230.709 19.686 316.464
>
> system.time(crawford.BSDT(c(90,110),mean.c=c(100,100),sd.c=c
> (10,10),n=5,r=0.6, standardized=T))
> user system elapsed
> 227.618 15.656 293.810
>
>
> R version
> -------
> >sessionInfo()
> R version 2.5.1 (2007-06-27)
> powerpc-apple-darwin8.9.1
>
> locale:
> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] "stats" "graphics" "grDevices"
"utils" "datasets"
> [6] "methods" "base"
>
> other attached packages:
> MCMCpack MASS coda lattice
> "0.8-2" "7.2-34" "0.11-2" "0.16-2"
>
>
>
>
> Matthieu Dubois
> Ph.D. Student
>
> Cognition and Development Lab
> Catholic University of Louvain
> 10, Place Cardinal Mercier
> B-1348 Louvain-la-Neuve - Belgium
>
> E-mail: Matthieu.Dubois at psp.ucl.ac.be
> Web: http://www.code.ucl.ac.be/MatthieuDubois/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595