I have written the following function to calculate the weighted mean difference for univariate data (see http://www.xycoon.com/gini_mean_difference.htm for a related formula). Unsurprisingly, the function is slow (compared to sd or mad) for long vectors. I wonder if there's a way to make the function faster, short of creating an external C function. Thanks very much for your advice. gmd <- function(x, w) { # x=data vector, w=weights vector n <- length(x) tmp <- 0 for (i in 1:n) { for (j in 1:n) { tmp <- tmp + w[i]*abs(x[i]-x[j]) } } retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w)) } gmd(rnorm(100)) Regards, Andrew C. Ward CAPE Centre Department of Chemical Engineering The University of Queensland Brisbane Qld 4072 Australia andreww at cheque.uq.edu.au
You can avoid the for loops with outer, but that will use more memory. gmd <- function(x, w) 0.5 * sqrt(pi) * sum(w * abs(outer(x, x, "-"))) / ((length(x)-1)*sum(w)) On Wednesday 23 April 2003 10:17 pm, Andrew C. Ward wrote:> I have written the following function to calculate the weighted mean > difference for univariate data (see > http://www.xycoon.com/gini_mean_difference.htm for a related > formula). Unsurprisingly, the function is slow (compared to sd or mad) > for long vectors. I wonder if there's a way to make the function > faster, short of creating an external C function. Thanks very much > for your advice. > > > gmd <- function(x, w) { # x=data vector, w=weights vector > n <- length(x) > tmp <- 0 > for (i in 1:n) { > for (j in 1:n) { > tmp <- tmp + w[i]*abs(x[i]-x[j]) > } > } > retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w)) > } > > gmd(rnorm(100)) > > > > Regards, > > Andrew C. Ward > > CAPE Centre > Department of Chemical Engineering > The University of Queensland > Brisbane Qld 4072 Australia > andreww at cheque.uq.edu.au > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
DJNordlund@aol.com
2003-Apr-24 06:53 UTC
[R] Fast R implementation of Gini mean difference
Andrew, being a novice when it comes to R programming, I was not aware of how to use "outer". I looked at the possibility of just eliminating the inner loop of your function. I compared your original function, gmd, with the version submitted by Deepayan Sarkar , which I called gmd.1, and my version eliminating the inner loop, gmd.2. (See below) The timings suggest that gmd.2 may be a little faster than gmd.1, and the memory utilization may be better (I am not exactly sure how to compare the memory footprints). I did try running my version on a vector of length=10,000 and it ran in about 10 seconds on a 1.4MHz Pentium IV with 256k of RAM. I hope this may be of some help; I know I learned while looking at this problem. Dan Nordlund ## original Gini-mean-difference function gmd <- function(x, w) { # x=data vector, w=weights vector n <- length(x) tmp <- 0 for (i in 1:n) { for (j in 1:n) { tmp <- tmp + w[i]*abs(x[i]-x[j]) } } retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w)) } ## function using outer as per Deepayan Sarkar gmd.1 <- function(x, w) 0.5 * sqrt(pi) * sum(w * abs(outer(x, x, "-"))) / ((length(x)-1)*sum(w)) ## function eliminating only the inner loop gmd.2 <- function(x, w) { # x=data vector, w=weights vector n <- length(x) tmp <- 0 for (i in 1:n) tmp <- tmp + sum(w[i]*abs(x[i]-x)) retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w)) } ##results of running the three functions on vector of length=1000>w<-runif(1000) >z<-rnorm(1000)> system.time(gmd(z, w))[1] 18.48 0.26 19.41 NA NA> system.time(gmd.1(z, w))[1] 0.97 0.02 1.01 NA NA> system.time(gmd.2(z, w))[1] 0.14 0.00 0.15 NA NA>[[alternate HTML version deleted]]
On Thursday 24 April 2003 05:17, you wrote:> I have written the following function to calculate the weighted mean > difference for univariate data (see > http://www.xycoon.com/gini_mean_difference.htm?for a related > formula). Unsurprisingly, the function is slow (compared to sd or mad) > for long vectors. I wonder if there's a way to make the function > faster, short of creating an external C function. Thanks very much > for your advice. > > > gmd <- function(x, w) { # x=data vector, w=weights vector > ? ?n <- length(x) > ? ?tmp <- 0 > ? ?for (i in 1:n) { > ? ? ? for (j in 1:n) { > ? ? ? ? ?tmp <- tmp + w[i]*abs(x[i]-x[j]) > ? ? ? } > ? ?} > ? ?retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w)) > }I have a few remarks on the "definition" of Gini index, and one on the most suitable way to compute it. (a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm (b) when the values of x vector have frequencies, then the common definition knwon to me is not the above one, but one such that the nested line above should be replaced by ? ? ? ? ?tmp <- tmp + w[i]*w[j]*abs(x[i]-x[j]) and the final assignment should be 0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w)) In fact, the above function produces inconsident results; see this:> gmd(c(1,2,2,4),c(1,1,1,1))[1] 1.329> gmd(c(1,2,4), c(1,2,1))[1] 1.662 while the above modifications lead again to > gmd(c(1,2,4), c(1,2,1)) [1] 1.329 (c) from the computational viewpoint, there is a much more convenient equivalent expression, based on the order statistics, leading to the following function: gini.md<- function(x) { # x=data vector j <-order(x) n <-length(x) return(4*sum((1:length(x))*x[j])/(n*(n-1)) -2*mean(x)*(n+1)/(n-1)) } This is intendend for the basic case of unreplicated data. With some algebraic work one should be able to obtain a similar expression for the general case ...once an agreement has been achieved on the definition of the index. Clearly, to convert the output from gini.md() to gmd..() one must multiply by 0.5*sqrt(pi). Here is a short comparison of timing, following the one prepared by Dan Nordlund.> x<-rnorm(5000) > n<-rpois(5000,5) > w<-n/sum(n)> system.time(gmd(x, w))[1] 241.44 0.21 250.20 0.00 0.00> system.time(gmd.1(x, w))[1] 5.01 8.69 129.95 0.00 0.00> system.time(gmd.2(x, w))[1] 1.21 1.55 10.19 0.00 0.00> system.time(gini.md(x))[1] 0 0 0 0 0 also, the new function is requires little memory regards Adelchi Azzalini -- Adelchi Azzalini <azzalini at stat.unipd.it> Dipart.Scienze Statistiche, Universit? di Padova, Italia http://azzalini.stat.unipd.it/
This is to complement my previous contribution on computation of Gini mean difference - a discussion started by Andrew Ward. The index is "defined" as gini <- 0 for (i in 1:n) { for (j in 1:n) gini <- gini + freq[i]*freq[j]*abs(x[i]-x[j]) } gini<- gini/((sum(freq)-1)*sum(freq)) This is the so-called form "without repetition"; the variant "with repetition" does not have -1 in the final line. Since computaation via the definition is totally inefficient, alternative approaches have been put forward, following Andrew's message. My first version of a computationally convenient implementation was essentially this: gini.md0<- function(x) { # x=data vector n <-length(x) return(4*sum((1:length(x))*sort(x)/(n*(n-1))) -2*mean(x)*(n+1)/(n-1)) } Since Andrew (private message) has stressed the importance in his problem of allowing for replicated data, here is a more general version, obtained by elaborating on the previous one with a bit of algebra: gini.md <- function(x, freq=rep(1,length(x))) {# x=data vector, freq=vector of frequencies if(!is.vector(x)) stop("x must be a vector") if(length(x) != length(freq)) stop("x and freq must have same length") if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq)) ) stop("freq must be counts") x <- x[freq>0] freq <- freq[freq>0] j <- order(x) x <- x[j] n <- as.integer(freq[j]) n. <- sum(n) u <- (cumsum(n)-n)*n+ n*(n+1)/2 return(4*sum(u*x)/(n.*(n.-1)) -2*weighted.mean(x,n)*(n.+1)/(n.-1)) } Notice that gini.md(x,freq) gives the same of mini.md0(rep(x,freq)), but the latter is obviously less efficient. Either are however far more efficient that straight implementation of the "definition". regards Adelchi Azzalini -- Adelchi Azzalini <azzalini at stat.unipd.it> Dipart.Scienze Statistiche, Universit? di Padova, Italia http://azzalini.stat.unipd.it/
Thank you again to Professor Azzalini for taking the time to expound this issue for me. It's been very instructive to look in detail at the definition of Gini mean difference and at the incorporation of weights. Noting suggestions for improving the speed of my function has also been very helpful. With my application, the weights refer to the "reliability" of a measurement, with a weight of 1 signifying high reliability and a weight close to zero indicating low reliability. The mean difference is used as a robust estimate of the standard deviation of the vector of measurements (the 0.5*sqrt(pi) multiplier is for this purpose). It seems to me that Professor Azzalini's result may be used for non-integer weights if the weights are scaled to sum to the number of measurements (ie. w <- w * length(x)/sum(w)). In this case, I think that gmd(x=c(1,2,4), w=c(1,2,1)) gives the same result at gmd(x=c(1,2,2,4), w=c(1,1,1,1)). Thank you again to all who have contributed to my understanding and R implementation of the mean difference. Regards, Andrew C. Ward CAPE Centre Department of Chemical Engineering The University of Queensland Brisbane Qld 4072 Australia andreww at cheque.uq.edu.au On Monday, April 28, 2003 6:55 PM, Adelchi Azzalini [SMTP:azzalini at stat.unipd.it] wrote:> > This is to complement my previous contribution on computationof> Gini mean > difference - a discussion started by Andrew Ward. The index is > "defined" as > gini <- 0 > for (i in 1:n) > { > for (j in 1:n) gini <- gini + freq[i]*freq[j]*abs(x > [i]-x[j]) > } > gini<- gini/((sum(freq)-1)*sum(freq)) > > This is the so-called form "without repetition"; the variant > "with repetition" > does not have -1 in the final line. > > Since computaation via the definition is totally inefficient, > alternative > approaches have been put forward, following Andrew's message. > > My first version of a computationally convenient implementation > was > essentially this: > > gini.md0<- function(x) > { # x=data vector > n <-length(x) > return(4*sum((1:length(x))*sort(x)/(n*(n-1))) > -2*mean(x)*(n+1)/(n-1)) > } > > > Since Andrew (private message) has stressed the importance in > his problem > of allowing for replicated data, here is a more generalversion,> obtained by > elaborating on the previous one with a bit of algebra: > > gini.md <- function(x, freq=rep(1,length(x))) > {# x=data vector, freq=vector of frequencies > if(!is.vector(x)) stop("x must be a vector") > if(length(x) != length(freq)) > stop("x and freq must have same length") > if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq)) > ) > stop("freq must be counts") > x <- x[freq>0] > freq <- freq[freq>0] > j <- order(x) > x <- x[j] > n <- as.integer(freq[j]) > n. <- sum(n) > u <- (cumsum(n)-n)*n+ n*(n+1)/2 > return(4*sum(u*x)/(n.*(n.-1)) > -2*weighted.mean(x,n)*(n.+1)/(n.-1)) > } > > Notice that gini.md(x,freq) gives the same of mini.md0(rep > (x,freq)), but the latter > is obviously less efficient. Either are however far more > efficient that straight > implementation of the "definition". > > regards > > Adelchi Azzalini > > -- > Adelchi Azzalini <azzalini at stat.unipd.it> > Dipart.Scienze Statistiche, Universita di Padova, Italia > http://azzalini.stat.unipd.it/ > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help