I discovered recently that the phrase "Oja median" produces no hits in Jonathan Baron's very valuable R search engine. I found this surprising since I've long regarded this idea as one of the more interesting notions in the multivariate robustness literature. To begin to remedy this oversight I wrote a bivariate version and then decided that writing a general p-variate version might make a nice weekend programming puzzle for R-help. Here are a few more details: The Oja median of n p-variate observations minimizes over theta in R^p the sum of the volumes of the simplices formed by theta, and p of the observed points, the sum being taken over all n choose p groups of p observations. Thus, in the bivariate case we are minimizing the sum of the areas of all triangles formed by the the point theta and pairs of observations. Here is a simple bivariate implementation: oja.median <-function(x) { #bivariate version -- x is assumed to be an n by 2 matrix require(quantreg) n <- dim(x)[1] A <- matrix(rep(1:n, n), n) i <- A[col(A) < row(A)] j <- A[n + 1. - col(A) > row(A)] xx <- cbind(x[i, ], x[j, ]) y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3] z1 <- (xx[, 4] - xx[, 2]) z2 <- - (xx[, 3] - xx[, 1]) return(rq(y~cbind(z1, z2)-1)$coef) } To understand the strategy, note that the area of the triangle formed by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2), and theta = (theta_1,theta_2) is given by the determinant, | 1 1 1 | Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|. |y_i2 yj2 theta_2| Expanding the determinant in the unknown parameters theta gives the l1 regression formulation. Remarkably, a result of Wilks says that if the call to rq() is replaced with a call to lm() you get the sample mean -- this gives an impressively inefficient least squares regression based alternative to apply(x,2,mean)! It also provides a useful debugging check for proposed algorithms. Obviously, the expansion of the determinant gives the same formulation for p>2, the challenge is to find a clean way to generate the design matrix and response vector for the general setting. Bon weekend! url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
I am shocked and dismayed (and the term hasn't even started yet ;-) ) that none of you have turned in the weekend homework problem that I assigned last Friday. At the risk of further embarrassment I am posting an answer in the hope that this will inspire someone to suggest some improvements. In particular you will see that the loop in the function "cofactors" when subjected to the apply is painfully slow. Using Rprof on an example with n=50 and p=3 shows that 160 seconds of the 167 needed were spent in this apply. Yes, I'm aware that this could be recoded in [C, fortran, ...], but that would be considered cheating. "mean.wilks" <- function(x){ # Computes the column means of the matrix x -- very sloooowly. n <- dim(x)[1] p <- dim(x)[2] A <- t(combn(1:n,p)) X <- NULL for(i in 1:p) X <- cbind(X,x[A[,i],]) oja.ize <- function(v)cofactors(matrix(v,sqrt(length(v)))) A <- t(apply(X,1,oja.ize)) coef(lm(-A[,1]~A[,-1]-1)) } "cofactors" <- function(A){ B <- rbind(1,cbind(A,1)) p <- ncol(B) x <- rep(0,p) for(i in 1:p) x[i] <- ((-1)^(i+p)) *det(B[-i,-p]) return(x) } url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Fri, 15 Aug 2003, Roger Koenker wrote:> I discovered recently that the phrase "Oja median" produces no hits in > Jonathan Baron's very valuable R search engine. I found this surprising > since I've long regarded this idea as one of the more interesting notions > in the multivariate robustness literature. To begin to remedy this oversight > I wrote a bivariate version and then decided that writing a general p-variate > version might make a nice weekend programming puzzle for R-help. > > Here are a few more details: > > The Oja median of n p-variate observations minimizes over theta > in R^p the sum of the volumes of the simplices formed by theta, > and p of the observed points, the sum being taken over all n choose p > groups of p observations. Thus, in the bivariate case we are minimizing > the sum of the areas of all triangles formed by the the point theta > and pairs of observations. Here is a simple bivariate implementation: > > oja.median <-function(x) { > #bivariate version -- x is assumed to be an n by 2 matrix > require(quantreg) > n <- dim(x)[1] > A <- matrix(rep(1:n, n), n) > i <- A[col(A) < row(A)] > j <- A[n + 1. - col(A) > row(A)] > xx <- cbind(x[i, ], x[j, ]) > y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3] > z1 <- (xx[, 4] - xx[, 2]) > z2 <- - (xx[, 3] - xx[, 1]) > return(rq(y~cbind(z1, z2)-1)$coef) > } > > To understand the strategy, note that the area of the triangle formed > by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2), > and theta = (theta_1,theta_2) is given by the determinant, > > | 1 1 1 | > Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|. > |y_i2 yj2 theta_2| > > Expanding the determinant in the unknown parameters theta gives > the l1 regression formulation. Remarkably, a result of Wilks says > that if the call to rq() is replaced with a call to lm() you get > the sample mean -- this gives an impressively inefficient least > squares regression based alternative to apply(x,2,mean)! > It also provides a useful debugging check for proposed algorithms. > > Obviously, the expansion of the determinant gives the same formulation > for p>2, the challenge is to find a clean way to generate the > design matrix and response vector for the general setting. > > Bon weekend! > > url: www.econ.uiuc.edu/~roger/my.html Roger Koenker > email rkoenker at uiuc.edu Department of Economics > vox: 217-333-4558 University of Illinois > fax: 217-244-6678 Champaign, IL 61820 > > >
A report on my Oja Median problem: Gabor Grothendieck suggested replacing my loopy cofactor function with: cofactors2 <- function(x) { x <- t(rbind(1,cbind(x,1))) p <- nrow(x) solve( x, (seq(p)==p)+0) * det(x) } Which is much better, especially if the dimension p is large. Unfortunately, it is still quite slow inside the apply loop of the main function, and this suggests that it is probably necessary to recode in C or Fortran to get something that would work well in realistic problems. In case anyone else would like to play with this...here is the original version again and a timing comparison: "mean.wilks" <- function(x){ # Computes the column means of the matrix x -- very sloooowly. n <- dim(x)[1] p <- dim(x)[2] A <- t(combn(1:n,p)) X <- NULL for(i in 1:p) X <- cbind(X,x[A[,i],]) oja.ize <- function(v)cofactors(matrix(v,sqrt(length(v)))) A <- t(apply(X,1,oja.ize)) coef(lm(-A[,1]~A[,-1]-1)) } "cofactors" <- function(A){ B <- rbind(1,cbind(A,1)) p <- ncol(B) x <- rep(0,p) for(i in 1:p) x[i] <- ((-1)^(i+p)) *det(B[-i,-p]) return(x) } ______________________________________ X <- matrix(rnorm(150),50) t1 <- unix.time(mean.wilks(X)->m1) t2 <- unix.time(mean.wilks2(X)->m2) # this uses cofactors2 t3 <- unix.time(apply(X,2,mean)->m3)> m1A[, -1]1 A[, -1]2 A[, -1]3 0.09205914 0.05603060 -0.24290857> m2A[, -1]1 A[, -1]2 A[, -1]3 0.09205914 0.05603060 -0.24290857> m3[1] 0.09205914 0.05603060 -0.24290857> t1[1] 147.70 25.33 173.41 0.00 0.00> t2[1] 108.97 21.87 131.23 0.00 0.00> t3[1] 0 0 0 0 0 NB. A reminder that this mean version is only a testing ground for the median version of Oja which replaces the lm() call with a median regression call. url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820