I discovered recently that the phrase "Oja median" produces no hits in
Jonathan Baron's very valuable R search engine. I found this surprising
since I've long regarded this idea as one of the more interesting notions
in the multivariate robustness literature. To begin to remedy this oversight
I wrote a bivariate version and then decided that writing a general p-variate
version might make a nice weekend programming puzzle for R-help.
Here are a few more details:
The Oja median of n p-variate observations minimizes over theta
in R^p the sum of the volumes of the simplices formed by theta,
and p of the observed points, the sum being taken over all n choose p
groups of p observations. Thus, in the bivariate case we are minimizing
the sum of the areas of all triangles formed by the the point theta
and pairs of observations. Here is a simple bivariate implementation:
oja.median <-function(x) {
#bivariate version -- x is assumed to be an n by 2 matrix
require(quantreg)
n <- dim(x)[1]
A <- matrix(rep(1:n, n), n)
i <- A[col(A) < row(A)]
j <- A[n + 1. - col(A) > row(A)]
xx <- cbind(x[i, ], x[j, ])
y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3]
z1 <- (xx[, 4] - xx[, 2])
z2 <- - (xx[, 3] - xx[, 1])
return(rq(y~cbind(z1, z2)-1)$coef)
}
To understand the strategy, note that the area of the triangle formed
by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2),
and theta = (theta_1,theta_2) is given by the determinant,
| 1 1 1 |
Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|.
|y_i2 yj2 theta_2|
Expanding the determinant in the unknown parameters theta gives
the l1 regression formulation. Remarkably, a result of Wilks says
that if the call to rq() is replaced with a call to lm() you get
the sample mean -- this gives an impressively inefficient least
squares regression based alternative to apply(x,2,mean)!
It also provides a useful debugging check for proposed algorithms.
Obviously, the expansion of the determinant gives the same formulation
for p>2, the challenge is to find a clean way to generate the
design matrix and response vector for the general setting.
Bon weekend!
url: www.econ.uiuc.edu/~roger/my.html Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820
I am shocked and dismayed (and the term hasn't even started yet ;-) )
that none of you have turned in the weekend homework problem that
I assigned last Friday. At the risk of further embarrassment I am
posting an answer in the hope that this will inspire someone to suggest
some improvements. In particular you will see that the loop in the
function "cofactors" when subjected to the apply is painfully slow.
Using Rprof on an example with n=50 and p=3 shows that 160 seconds of the
167 needed were spent in this apply. Yes, I'm aware that this could
be recoded in [C, fortran, ...], but that would be considered cheating.
"mean.wilks" <- function(x){
# Computes the column means of the matrix x -- very sloooowly.
n <- dim(x)[1]
p <- dim(x)[2]
A <- t(combn(1:n,p))
X <- NULL
for(i in 1:p)
X <- cbind(X,x[A[,i],])
oja.ize <- function(v)cofactors(matrix(v,sqrt(length(v))))
A <- t(apply(X,1,oja.ize))
coef(lm(-A[,1]~A[,-1]-1))
}
"cofactors" <- function(A){
B <- rbind(1,cbind(A,1))
p <- ncol(B)
x <- rep(0,p)
for(i in 1:p)
x[i] <- ((-1)^(i+p)) *det(B[-i,-p])
return(x)
}
url: www.econ.uiuc.edu/~roger/my.html Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820
On Fri, 15 Aug 2003, Roger Koenker wrote:
> I discovered recently that the phrase "Oja median" produces no
hits in
> Jonathan Baron's very valuable R search engine. I found this surprising
> since I've long regarded this idea as one of the more interesting
notions
> in the multivariate robustness literature. To begin to remedy this
oversight
> I wrote a bivariate version and then decided that writing a general
p-variate
> version might make a nice weekend programming puzzle for R-help.
>
> Here are a few more details:
>
> The Oja median of n p-variate observations minimizes over theta
> in R^p the sum of the volumes of the simplices formed by theta,
> and p of the observed points, the sum being taken over all n choose p
> groups of p observations. Thus, in the bivariate case we are minimizing
> the sum of the areas of all triangles formed by the the point theta
> and pairs of observations. Here is a simple bivariate implementation:
>
> oja.median <-function(x) {
> #bivariate version -- x is assumed to be an n by 2 matrix
> require(quantreg)
> n <- dim(x)[1]
> A <- matrix(rep(1:n, n), n)
> i <- A[col(A) < row(A)]
> j <- A[n + 1. - col(A) > row(A)]
> xx <- cbind(x[i, ], x[j, ])
> y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3]
> z1 <- (xx[, 4] - xx[, 2])
> z2 <- - (xx[, 3] - xx[, 1])
> return(rq(y~cbind(z1, z2)-1)$coef)
> }
>
> To understand the strategy, note that the area of the triangle formed
> by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2),
> and theta = (theta_1,theta_2) is given by the determinant,
>
> | 1 1 1 |
> Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|.
> |y_i2 yj2 theta_2|
>
> Expanding the determinant in the unknown parameters theta gives
> the l1 regression formulation. Remarkably, a result of Wilks says
> that if the call to rq() is replaced with a call to lm() you get
> the sample mean -- this gives an impressively inefficient least
> squares regression based alternative to apply(x,2,mean)!
> It also provides a useful debugging check for proposed algorithms.
>
> Obviously, the expansion of the determinant gives the same formulation
> for p>2, the challenge is to find a clean way to generate the
> design matrix and response vector for the general setting.
>
> Bon weekend!
>
> url: www.econ.uiuc.edu/~roger/my.html Roger Koenker
> email rkoenker at uiuc.edu Department of Economics
> vox: 217-333-4558 University of Illinois
> fax: 217-244-6678 Champaign, IL 61820
>
>
>
A report on my Oja Median problem:
Gabor Grothendieck suggested replacing my loopy cofactor function with:
cofactors2 <- function(x) {
x <- t(rbind(1,cbind(x,1)))
p <- nrow(x)
solve( x, (seq(p)==p)+0) * det(x)
}
Which is much better, especially if the dimension p is large. Unfortunately,
it is still quite slow inside the apply loop of the main function, and
this suggests that it is probably necessary to recode in C or Fortran
to get something that would work well in realistic problems. In case
anyone else would like to play with this...here is the original version
again and a timing comparison:
"mean.wilks" <- function(x){
# Computes the column means of the matrix x -- very sloooowly.
n <- dim(x)[1]
p <- dim(x)[2]
A <- t(combn(1:n,p))
X <- NULL
for(i in 1:p)
X <- cbind(X,x[A[,i],])
oja.ize <- function(v)cofactors(matrix(v,sqrt(length(v))))
A <- t(apply(X,1,oja.ize))
coef(lm(-A[,1]~A[,-1]-1))
}
"cofactors" <- function(A){
B <- rbind(1,cbind(A,1))
p <- ncol(B)
x <- rep(0,p)
for(i in 1:p)
x[i] <- ((-1)^(i+p)) *det(B[-i,-p])
return(x)
}
______________________________________
X <- matrix(rnorm(150),50)
t1 <- unix.time(mean.wilks(X)->m1)
t2 <- unix.time(mean.wilks2(X)->m2) # this uses cofactors2
t3 <- unix.time(apply(X,2,mean)->m3)> m1
A[, -1]1 A[, -1]2 A[, -1]3
0.09205914 0.05603060 -0.24290857> m2
A[, -1]1 A[, -1]2 A[, -1]3
0.09205914 0.05603060 -0.24290857> m3
[1] 0.09205914 0.05603060 -0.24290857> t1
[1] 147.70 25.33 173.41 0.00 0.00> t2
[1] 108.97 21.87 131.23 0.00 0.00> t3
[1] 0 0 0 0 0
NB. A reminder that this mean version is only a testing ground for
the median version of Oja which replaces the lm() call with a median
regression call.
url: www.econ.uiuc.edu/~roger/my.html Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820