R developers, I have a first attempt to make an aov function. Eventually I want to build in Error() structure, but first I am trying to get this presentable for balanced data with only a single stratum, just using residual error. I am following R. M. Heiberger's Computation for the Analysis of Designed Experiments, Wiley (1989) I a using a wrapper (aov.bal) to call the fitting function in the same way that lm() does a bunch of stuff and then calls lm.fit(), in fact I took most of the wrapper right out of lm(). I'm trying to make the aov object look as much as possible like an lm.object Several questions have come up which need discussion. Q 1) Is aov supposed to avoid qr decomposition and matrix inversion? In S it is claimed to be faster than lm() for large datasets. Is that due to avoidance of qr()? (My main goal is to get the Error strata working.) Q 2) If the speed is important, can we give up choice of contrasts used to set up the dummy factor variables? ( contr.treatment, contr.sum) Why do I ask? In pursuit of speed (we're talking nanoseconds for a reasonable size data set) I have forced in helmert contrasts because their orthogonality makes it slick to compute. So in this version, whatever option(contrasts) is set will be ignored. I don't like making that choice for the user esp. since it gives wrong answers under the default contrasts, and have thought about converting back into the original contrasts, but it isn't easy with higher-way anova models. It looks easy in VR2 p 197-204 (BTW I have kroenecker and ginv functions if anyone wants them) but I don't get it to work out nicely. V&R show the contrast transformation as: alpha_T = ginv(C_T) %*% C_H %*% alpha_H where C_T is the contrast matrix for treatment contrasts, and C_H for Helmert contrasts. The alpha's contain only the treatment contrast effects, not the mean. This would leave the intercept estimate unchanged (not right). For single factors it works to stick a col.vector of 1s on the front of C_H and C_T, and include the mu.hat in the alpha_H or alpha_T. But when I get into interactions, I don't see what is needed. Q 3) I am having a problem with keep.order. With balanced data, I know the effects are orthogonal, so order is not an issue, but as I move onto the tougher cases, I need model.matrix to preserve order, and it doesn't. Q 4) What are the $effects in lm.objects? For the moment I've stuck in the coefficients. Q 5) Any objection to passing cov.unscaled from aov to summary.aov? (Since I'm not using qr and can't pass R) Feedback needed, comments are welcome. Jim Robison-Cox ____________ Department of Math Sciences | | phone: (406)994-5340 2-214 Wilson Hall \ BZN, MT | FAX: (406)994-1789 Montana State University | *_______| Bozeman, MT 59717 \_| e-mail: jimrc@math.montana.edu #--------------------------------------------------------------------------- ## First hack at aov for balanced data aov.bal <- function(formula, data = list(), subset, weights, na.action, method = "qr", model = TRUE, keep.order=F, ...) { mt <- terms(formula, data = data, keep.order = keep.order) nt <- length(attr(mt,"order")[attr(mt,"order") ==1]) lt <- rep("contr.helmert",nt) ## Sets helmert contrasts names(lt) <- attr(mt,"term.labels")[ attr(mt,"order") ==1] mf <- match.call() mf$model <- NULL mf[[1]] <- as.name("model.frame") mf <- eval(mf, sys.frame(sys.parent())) if (method == "model.frame") return(mf) else if (method != "qr") ## change for aov?? warning(paste("method =", method, "is not supported. Using \"qr\".")) if (length(list(...))) warning(paste("Extra arguments", deparse(substitute(...)), "are just disregarded.")) y <- model.response(mf, "numeric") w <- model.weights(mf) if (is.empty.model(mt)) { z <- list(coefficients = numeric(0), residuals = y, fitted.values = 0 * y, weights = w, rank = 0, df.residual = length(y)) class(z) <- if (is.matrix(y)) c("mlm.null", "lm.null", "mlm", "lm") else c("lm.null", "lm") } else { x <- model.matrix(mt, mf, as.list(lt)) # forces in contr.helmert z <- if (is.null(w)) { aov.balanced.fit(x, y, mt) } else { stop("Weighted aov not implemented, use lm") #aov.balanced.wfit(x, y, mt, w) } class(z) <- c(if (is.matrix(y)) c("maov","mlm"), c("aov","lm")) } z$call <- match.call() z$terms <- mt if (model) z$model <- mf z } aov.balanced.fit <- function(x,y,mod.terms,wt=NULL){ x <- as.matrix(x); dimx <- dim(x) xnames <- dimnames(x)[[2]] y <- as.matrix(y); dimy <- dim(y) assn <- attr(x,"assign") df <- c(table(assn), dimy[1]- dimx[2]) D <- diag(1/diag(t(x) %*% x)) dimnames(D) <- list(xnames,xnames) coef <- as.numeric(D %*% t(x) %*% y) names(coef) <- xnames fits <- x %*% coef resid <- y - fits eff <- apply((t(x) * rep(coef,dimx[1]))^2,1,sum) ## Take each column of x (row of t(x)) times its ## coefficient, square the result and sum (over columns of x) eff <- c(sapply( split(eff,assn), sum), sum(resid^2)) ## Adds them up over the factors. if(attr(mod.terms,"intercept") == 1) { eff <- eff[-1] df <- df[-1] } name <- c(attr(mod.terms,"term.labels"), "Residuals") names(eff) <- name names(df) <- name list(coefficients= coef, residuals = resid, effects = eff, rank= dimx[2], fitted.values = fits, assign= assn, cov.unscaled = D, df.residual = dimy[1] - dimx[2], df = df ) } print.aov <- function (x, digits = max(3, .Options$digits - 3), ...) { lenx <- length(x$df) tabl <- cbind(x$df,x$effects, x$effects/x$df) tabl <- cbind(tabl,c(tabl[1:(lenx-1),3]/tabl[lenx,3],NA)) tabl <- cbind(tabl,1-pf(tabl[,4],tabl[,1],tabl[lenx,1])) dimnames(tabl) <- list(names(x$effects), c("df","SS","MS","F","P > F")) cat("\nAnalysis of Variance:\n\n") print.default(round(tabl, digits), na = "", print.gap = 2) cat("\n") invisible(x) } summary.aov <- function (z, correlation=F) { p <- z$rank p1 <- 1:p r <- resid(z) f <- fitted(z) n <- length(f) w <- weights(z) if (is.null(z$terms)) { stop("invalid 'aov' object: no terms component") } else { if (is.null(w)) { mss <- if (attr(z$terms, "intercept")) sum((f - mean(f))^2) else sum(f^2) rss <- sum(r^2) } } resvar <- rss/(n - p) se <- sqrt(resvar * diag(z$cov.unscaled)) est <- z$coefficients tval <- est/se ans <- z[c("call", "terms")] ans$residuals <- as.numeric(r) ans$coefficients <- cbind(est, se, tval, 2 * (1 - pt(abs(tval), n - p))) dimnames(ans$coefficients) <- list(names(z$coefficients), c("Estimate", "Std.Error", "t Value", "Pr(>|t|)")) ans$sigma <- sqrt(resvar) ans$df <- c(p, n - p) if (p != attr(z$terms, "intercept")) { df.int <- if (attr(z$terms, "intercept")) 1 else 0 ans$r.squared <- mss/(mss + rss) #0.14 : (n/(n-p)) ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/(n - p)) ans$fstatistic <- c((mss/(p - df.int))/(rss/(n - p)), p - df.int, n - p) #0.14: ans$fstatistic <- c((mss/(p-1))/(rss/(n-p)),p-1,n-p) names(ans$fstatistic) <- c("value", "numdf", "dendf") } ans$cov.unscaled <- z$cov.unscaled if (correlation) { invsqrt.diag <- diag(sqrt(1/diag(z$cov.unscaled))) ans$correlation <- invsqrt.diag %*% z$cov.unscaled %*% invsqrt.diag dimnames(ans$correlation) <- dimnames(z$cov.unscaled) } class(ans) <- "summary.aov" ans } print.summary.aov <- function (x, digits = max(3, .Options$digits - 3), roundfun = round, ...) { cat("\nCall:\n") cat(paste(deparse(x$call), sep = "\n", collapse = "\n"), "\n\n", sep = "") resid <- x$residuals df <- x$df rdf <- df[2] if (rdf > 5) { cat("Residuals:\n") if (length(dim(resid)) == 2) { rq <- apply(t(resid), 1, quantile) dimnames(rq) <- list(c("Min", "1Q", "Median", "3Q", "Max"), dimnames(resid)[[2]]) } else { rq <- quantile(as.numeric(resid)) names(rq) <- c("Min", "1Q", "Median", "3Q", "Max") } print(rq, digits = digits, ...) } else if (rdf > 0) { cat("Residuals:\n") print(resid, digits = digits, ...) } # if (nsingular <- df[3] - df[1]) # cat("\nCoefficients: (", nsingular, " not defined because of singularities)\n", # sep = "") # else cat("\nCoefficients:\n") print(roundfun(x$coefficients, digits = digits), quote = FALSE, ...) cat("\nResidual standard error:", format(signif(x$sigma, digits)), "on", rdf, "degrees of freedom\n") if (!is.null(x$fstatistic)) { cat("Multiple R-Squared:", format(signif(x$r.squared, digits))) cat(",\tAdjusted R-squared:", format(signif(x$adj.r.squared, digits)), "\n") cat("F-statistic:", format(signif(x$fstatistic[1], digits)), "on", x$fstatistic[2], "and", x$fstatistic[3], "degrees of freedom") cat(",\tp-value:", format(signif(1 - pf(x$fstatistic[1], x$fstatistic[2], x$fstatistic[3]), digits)), "\n") } correl <- x$correlation if (!is.null(correl)) { p <- dim(correl)[2] if (p > 1) { cat("\nCorrelation of Coefficients:\n") correl[!lower.tri(correl)] <- NA print(correl[-1, -NCOL(correl)], digits = digits, na = "") } } cat("\n") invisible(x) } #------------------------------------------------------------------------- ## Rd file: \name{aov.bal} \title{Function to compute analysis of variance for balanced designs} \usage{ aov.bal(formula, data=list(), subset, weights, na.action, method="qr", model=TRUE, keep.order=F, ...) } %- maybe also `usage' for other functions documented here. \alias{aov.bal} \alias{summary.aov} \alias{coefficients.lm} \alias{df.residual.lm} \alias{fitted.values.lm} \alias{residuals.lm} \alias{aov.balanced.fit} \alias{print.aov} \alias{print.summary.aov} \keyword{regression} %- Also NEED an `\alias' for EACH other function documented here. \arguments{ \item{formula}{A detailed description of the model to be fit. See details under \code{\link{lm}} } \item{data}{ data frame contining the data to be fit. Default is working directory.} \item{subset}{Description of what rows of the data frame to use in the fit. } \item{na.action}{ Currently not used } \item{method}{Not currently used, except that method="model.frame" will extract the model frame only and not fit the anova.} \item{model}{ If true, model frame is included in the output.} \item{keep.order}{If true, model should be evaluated in the order specified by the model statement. (Doesn't work.)} \item{\dots}{ other arguments (not presently used) } } \description{ \code{aov.bal} is used to fit analysis of variance models. Eventually it will handle multiple strata. The generic accessor functions \code{coefficients}, \code{fitted.values} and \code{residuals} can be used to extract various useful features of the value returned by \code{aov.bal}. Currently \code{aov.bal} uses only the Helmert contrasts and ignores the contrast functions set by options()$contrast. } \value{ \code{aov.bal} returns an object of class code\aov} which may be printed with the \code{print.aov} or \code{summary.aov}. \item{coefficients}{Coefficients under the Helmert contrasts.} \item{residuals} {Residuals} \item{effects } {effects ?=? coefficients?} \item{rank} \item{fitted.values} \item{assign} \item{var } \item{df.residual} \item{df} \item{call} \item{terms} } \references{ Heiberger, R. M., 1989, Computation for the Analysis of Designed Experiments, Wiley} \author{ Jim Robison-Cox } \seealso{ \code{\link{lm}},\code{\link{glm}},\code{\link{coefficients}},\code{\link{residuals}},\code{\link{fitted.values}} } \examples{ speed <- c(59, 69, 66, 75, 54, 65, 58, 70, 66, 77, 71, 71, 58, 65, 56, 62, 54, 60, 52, 64, 59, 73, 60, 70) typists <- gl(6,4,24) brands <- gl(4,1,24) type.aov <- aov.bal(speed ~ typists + brands) summary(type.aov) } \keyword{ regression }%-- one or more ... \keyword{models} -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._