thr3ads.net - R help - [R] Fast R implementation of Gini mean difference [Apr 2003]

If this information is useful, please help other people find it:
Share via:

Andrew C. Ward

2003-Apr-24 03:17 UTC

[R] Fast R implementation of Gini mean difference

I have written the following function to calculate the weighted mean
difference for univariate data (see 
http://www.xycoon.com/gini_mean_difference.htm for a related 
formula). Unsurprisingly, the function is slow (compared to sd or mad)
for long vectors. I wonder if there's a way to make the function
faster, short of creating an external C function. Thanks very much
for your advice.


gmd <- function(x, w) { # x=data vector, w=weights vector
   n <- length(x)
   tmp <- 0
   for (i in 1:n) {
      for (j in 1:n) {
         tmp <- tmp + w[i]*abs(x[i]-x[j])
      }
   }
   retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
}

gmd(rnorm(100))



Regards,

Andrew C. Ward

CAPE Centre
Department of Chemical Engineering
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au

Deepayan Sarkar

2003-Apr-24 04:05 UTC

head link

[R] Fast R implementation of Gini mean difference

You can avoid the for loops with outer, but that will use more memory.

gmd <- 
   function(x, w) 0.5 * sqrt(pi) * sum(w * abs(outer(x, x, "-"))) /
       ((length(x)-1)*sum(w))


On Wednesday 23 April 2003 10:17 pm, Andrew C. Ward
wrote:> I have written the following function to calculate the weighted mean
> difference for univariate data (see
> http://www.xycoon.com/gini_mean_difference.htm for a related
> formula). Unsurprisingly, the function is slow (compared to sd or mad)
> for long vectors. I wonder if there's a way to make the function
> faster, short of creating an external C function. Thanks very much
> for your advice.
>
>
> gmd <- function(x, w) { # x=data vector, w=weights vector
>    n <- length(x)
>    tmp <- 0
>    for (i in 1:n) {
>       for (j in 1:n) {
>          tmp <- tmp + w[i]*abs(x[i]-x[j])
>       }
>    }
>    retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
> }
>
> gmd(rnorm(100))
>
>
>
> Regards,
>
> Andrew C. Ward
>
> CAPE Centre
> Department of Chemical Engineering
> The University of Queensland
> Brisbane Qld 4072 Australia
> andreww at cheque.uq.edu.au
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help

DJNordlund@aol.com

2003-Apr-24 06:53 UTC

head link

[R] Fast R implementation of Gini mean difference

Andrew,

being a novice when it comes to R programming, I was not aware of how to use 
"outer".  I looked at the possibility of just eliminating the inner
loop of
your function.  I compared your original function, gmd, with the version 
submitted by Deepayan Sarkar , which I called gmd.1, and my version 
eliminating the inner loop, gmd.2.  (See below)

The timings suggest that gmd.2 may be a little faster than gmd.1, and the 
memory utilization may be better (I am not exactly sure how to compare the 
memory footprints).  I did try running my version on a vector of 
length=10,000 and it ran in about 10 seconds on a 1.4MHz Pentium IV with 256k 
of RAM.

I hope this may be of some help; I know I learned while looking at this 
problem.

Dan Nordlund



## original Gini-mean-difference function 
gmd <- function(x, w) { # x=data vector, w=weights vector
   n <- length(x)
   tmp <- 0
   for (i in 1:n) {
      for (j in 1:n) {
         tmp <- tmp + w[i]*abs(x[i]-x[j])
      }
   }
   retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
}

## function using outer as per Deepayan Sarkar
gmd.1 <- 
   function(x, w) 0.5 * sqrt(pi) * sum(w * abs(outer(x, x, "-"))) /
       ((length(x)-1)*sum(w))

## function eliminating only the inner loop
gmd.2 <- function(x, w) { # x=data vector, w=weights vector
   n <- length(x)
   tmp <- 0
   for (i in 1:n) tmp <- tmp + sum(w[i]*abs(x[i]-x))
   retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
}

##results of running the three functions on vector of
length=1000>w<-runif(1000)
>z<-rnorm(1000)
> system.time(gmd(z, w))[1] 18.48  0.26 19.41    NA    NA
> system.time(gmd.1(z, w))[1] 0.97 0.02 1.01   NA   NA
> system.time(gmd.2(z, w))
[1] 0.14 0.00 0.15   NA   NA>
	[[alternate HTML version deleted]]

Adelchi Azzalini

2003-Apr-24 13:41 UTC

head link

[R] Fast R implementation of Gini mean difference

On Thursday 24 April 2003 05:17, you wrote:> I have written the following function to calculate the weighted mean
> difference for univariate data (see
> http://www.xycoon.com/gini_mean_difference.htm?for a related
> formula). Unsurprisingly, the function is slow (compared to sd or mad)
> for long vectors. I wonder if there's a way to make the function
> faster, short of creating an external C function. Thanks very much
> for your advice.
>
>
> gmd <- function(x, w) { # x=data vector, w=weights vector
> ? ?n <- length(x)
> ? ?tmp <- 0
> ? ?for (i in 1:n) {
> ? ? ? for (j in 1:n) {
> ? ? ? ? ?tmp <- tmp + w[i]*abs(x[i]-x[j])
> ? ? ? }
> ? ?}
> ? ?retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
> }
I have a few remarks on the "definition" of Gini index, and one on the
most
suitable way to compute it.

(a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor
   in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm
(b) when the values of x vector have frequencies, then the common
   definition knwon to me is not the above one, but one such that the nested
line
   above should be replaced by
? ? ? ? ?tmp <- tmp + w[i]*w[j]*abs(x[i]-x[j])
   and the final assignment should be
          0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w))
  In fact, the above function produces inconsident results; see
this:>  gmd(c(1,2,2,4),c(1,1,1,1))
[1] 1.329> gmd(c(1,2,4), c(1,2,1))[1] 1.662

while the above modifications lead again to 
 > gmd(c(1,2,4), c(1,2,1))
[1] 1.329


(c) from the computational viewpoint, there is a much more convenient equivalent
   expression, based on the order statistics, leading to the following function:

gini.md<- function(x)
  { # x=data vector
    j <-order(x)
    n <-length(x)
    return(4*sum((1:length(x))*x[j])/(n*(n-1))
          -2*mean(x)*(n+1)/(n-1))
  }

This is intendend for the basic case of unreplicated data. With some algebraic
work
one should be able to obtain a similar expression for the general case ...once
an
agreement has been achieved on the definition of the index.

Clearly, to convert the output from gini.md() to gmd..() one must multiply by
0.5*sqrt(pi).

Here is a short comparison of timing, following the one prepared by Dan
Nordlund.
> x<-rnorm(5000)
>  n<-rpois(5000,5)
> w<-n/sum(n)
>  system.time(gmd(x, w))
[1] 241.44   0.21 250.20   0.00   0.00>  system.time(gmd.1(x, w))
[1]   5.01   8.69 129.95   0.00   0.00>  system.time(gmd.2(x, w))
[1]  1.21  1.55 10.19  0.00  0.00> system.time(gini.md(x))[1] 0 0 0 0 0

also, the new function is  requires little memory

regards

Adelchi Azzalini

-- 
Adelchi Azzalini  <azzalini at stat.unipd.it>
Dipart.Scienze Statistiche, Universit? di Padova, Italia
http://azzalini.stat.unipd.it/

Adelchi Azzalini

2003-Apr-28 08:55 UTC

head link

[R] Fast R implementation of Gini mean difference

This is to complement my previous contribution on computation of Gini mean  
difference - a discussion started by Andrew Ward. The index is
"defined" as
    gini  <- 0
      for (i in 1:n) 
         {
         for (j in 1:n)  gini <- gini + freq[i]*freq[j]*abs(x[i]-x[j])
         }
    gini<- gini/((sum(freq)-1)*sum(freq))

This is  the so-called form "without repetition"; the variant
"with repetition"
does not have -1 in the final line.

Since computaation via the definition is totally inefficient, alternative
approaches have been put forward, following Andrew's message.

My first version of a computationally convenient implementation was
essentially this:

gini.md0<- function(x)
 { # x=data vector
   n <-length(x)
   return(4*sum((1:length(x))*sort(x)/(n*(n-1)))
       -2*mean(x)*(n+1)/(n-1))
  }


Since Andrew (private message) has stressed the importance in his problem
of allowing for replicated data, here is a more general version, obtained by 
elaborating on the previous one with a bit of algebra:

gini.md <- function(x, freq=rep(1,length(x)))
{# x=data vector, freq=vector of frequencies
  if(!is.vector(x)) stop("x must be a vector")
  if(length(x) != length(freq)) 
       stop("x and freq must have same length")  
  if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq)) ) 
             stop("freq must be counts")
     x <- x[freq>0]
     freq <- freq[freq>0]
     j <- order(x)
     x <- x[j]
     n <- as.integer(freq[j])
     n. <- sum(n)
     u <- (cumsum(n)-n)*n+ n*(n+1)/2
     return(4*sum(u*x)/(n.*(n.-1))
         -2*weighted.mean(x,n)*(n.+1)/(n.-1))
}
  
Notice that gini.md(x,freq) gives the same of mini.md0(rep(x,freq)), but the
latter
is obviously less efficient. Either are however far more efficient that straight
implementation of the "definition".

regards

Adelchi Azzalini

-- 
Adelchi Azzalini  <azzalini at stat.unipd.it>
Dipart.Scienze Statistiche, Universit? di Padova, Italia
http://azzalini.stat.unipd.it/

Andrew C. Ward

2003-Apr-28 20:02 UTC

head link

[R] Fast R implementation of Gini mean difference

Thank you again to Professor Azzalini for taking the time to
expound this issue for me. It's been very instructive to look
in detail at the definition of Gini mean difference and at the
incorporation of weights. Noting suggestions for improving the
speed of my function has also been very helpful.

With my application, the weights refer to the "reliability" of
a measurement, with a weight of 1 signifying high reliability
and a weight close to zero indicating low reliability. The
mean difference is used as a robust estimate of the standard
deviation of the vector of measurements (the 0.5*sqrt(pi)
multiplier is for this purpose).

It seems to me that Professor Azzalini's result may be used for
non-integer weights if the weights are scaled to sum to the
number of measurements (ie. w <- w * length(x)/sum(w)). In this
case, I think that gmd(x=c(1,2,4), w=c(1,2,1)) gives the same
result at gmd(x=c(1,2,2,4), w=c(1,1,1,1)).

Thank you again to all who have contributed to my understanding
and R implementation of the mean difference.


Regards,

Andrew C. Ward

CAPE Centre
Department of Chemical Engineering
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au



On Monday, April 28, 2003 6:55 PM, Adelchi Azzalini 
[SMTP:azzalini at stat.unipd.it] wrote:>
> This is to complement my previous contribution on computation 
of> Gini mean
> difference - a discussion started by Andrew Ward. The index is
> "defined" as
>     gini  <- 0
>       for (i in 1:n)
>          {
>          for (j in 1:n)  gini <- gini + freq[i]*freq[j]*abs(x
>          [i]-x[j])
>          }
>     gini<- gini/((sum(freq)-1)*sum(freq))
>
> This is  the so-called form "without repetition"; the variant
> "with repetition"
> does not have -1 in the final line.
>
> Since computaation via the definition is totally inefficient,
> alternative
> approaches have been put forward, following Andrew's message.
>
> My first version of a computationally convenient implementation
> was
> essentially this:
>
> gini.md0<- function(x)
>  { # x=data vector
>    n <-length(x)
>    return(4*sum((1:length(x))*sort(x)/(n*(n-1)))
>        -2*mean(x)*(n+1)/(n-1))
>   }
>
>
> Since Andrew (private message) has stressed the importance in
> his problem
> of allowing for replicated data, here is a more general 
version,> obtained by
> elaborating on the previous one with a bit of algebra:
>
> gini.md <- function(x, freq=rep(1,length(x)))
> {# x=data vector, freq=vector of frequencies
>   if(!is.vector(x)) stop("x must be a vector")
>   if(length(x) != length(freq))
>        stop("x and freq must have same length")
>   if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq))
>   )
>              stop("freq must be counts")
>      x <- x[freq>0]
>      freq <- freq[freq>0]
>      j <- order(x)
>      x <- x[j]
>      n <- as.integer(freq[j])
>      n. <- sum(n)
>      u <- (cumsum(n)-n)*n+ n*(n+1)/2
>      return(4*sum(u*x)/(n.*(n.-1))
>          -2*weighted.mean(x,n)*(n.+1)/(n.-1))
> }
>
> Notice that gini.md(x,freq) gives the same of mini.md0(rep
> (x,freq)), but the latter
> is obviously less efficient. Either are however far more
> efficient that straight
> implementation of the "definition".
>
> regards
>
> Adelchi Azzalini
>
> --
> Adelchi Azzalini  <azzalini at stat.unipd.it>
> Dipart.Scienze Statistiche, Universita di Padova, Italia
> http://azzalini.stat.unipd.it/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Maybe Matching Threads

Search for more possibly parallel threads

R help - Apr 2003 - Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

[R] Fast R implementation of Gini mean difference

Maybe Matching Threads