doorz
2016-Feb-27 09:38 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
I have a considerable interest in trying to improve the predictions from 10+ highly correlated predictors to two (2) highly correlated responses. OLS does not allow me to take into account the correlation between the responses. Multivariate can take into account the correlations between the predictors. I have tried to use canonical correlations (R function cancor). Canonical correlations nicely show the interrelations between the responses and predictors. Predicting the responses using cancor is a different matter. There is no predict method that supports cancor. Simple backsubstition leads to improved correlations but poor pairwise correspondence. (R code I use is below, for a 2 predictor, 2 response set of data, this is a book example (Frets, 1921 headsize data), my data is similar but can't be summarized in this post as easily) The 1997 paper by Breiman and Friedman gives a procedure using shrinkage on least squares estimates obtained from canonical variates and their origins (the responses). They call this Curds and Whey. See this post. ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf There is a master thesis (Kidd) which uses the above and promises an R package to handle the curds and whey procedures. That R package does not seem to have been made public though. See here: http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd Since the Breimam and Friedman paper is almost 20 years old I am wondering about their procedure. One would think that it would have been adopted if it is as powerful as it is claimed to be since correlated dependents are quite common. Is there a flaw in it that I do not know of? Is there anything better available today? Any tips are most welcome. Regards, Alex van der Spek #! usr/env/bin R #CCA example try out #Use data on head size of sons/brothers length and breadth #Taken from the book by Everitt #Clean slate rm(list=ls()) ## Panel function, put histograms on the diagonal panel.hist <- function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE, breaks=21) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col="white", ...) } ## Panel function, put correlation in top panel.cor <- function(x, y, digits=2, prefix="", cex.cor) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } #Bring up file selection box, set working directory and get base name fn <- file.choose() fs <- basename(fn) fp <- dirname(fn) #Read data in by csv hs.r <- read.csv(fn) #This is the entire data set, 25 points, two predictors (x) and two responses (y)> head(hs.r, 25)x1 x2 y1 y2 1 191 155 179 145 2 195 149 201 152 3 181 148 185 149 4 183 153 188 149 5 176 144 171 142 6 208 157 192 152 7 189 150 190 149 8 197 159 189 152 9 188 152 197 159 10 192 150 187 151 11 179 158 186 148 12 183 147 174 147 13 174 150 185 152 14 190 159 195 157 15 188 151 187 158 16 163 137 161 130 17 195 155 183 158 18 186 153 173 148 19 181 145 182 146 20 175 140 165 137 21 192 154 185 152 22 174 143 178 147 23 176 139 176 143 24 197 167 200 158 25 190 163 187 150 #Make a scatter plot matrix pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor, diag.panel = panel.hist) #Sweep out summary stats hs.s <- scale(hs.r, center=F, scale=T) hs.c <- scale(hs.r, center=T, scale=F) hs.a <- scale(hs.r, center=T, scale=T) #Make a scatter plot matrix of centered and scaled, standardized data pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor, diag.panel = panel.hist) hs.1 <- hs.a[,1:2] hs.2 <- hs.a[,3:4] cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F) #Compute the decompositon #u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] * hs.1[,'x2'] #v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] * hs.2[,'y2'] #u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] * hs.1[,'x2'] #v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] * hs.2[,'y2'] #Quicker and better uu <- hs.1 %*% cc$xcoef vv <- hs.2 %*% cc$ycoef #Store in a dataframe uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2]) #Make a scatter plot matrix pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor, diag.panel = panel.hist) #Make a linear model, the significant coefs will prove to be the eigenvalues lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv) summary(lm.uv) #Notation x * k * A == y * B, solve for y A <- cc$xcoef B <- cc$ycoef r <- cc$cor #Compute the back substitioon C <- A %*% diag(r) %*% solve(B) yy <- hs.1 %*% C colnames(yy) <- c('yy1', 'yy2') #Show how this relates to the original responses, not bad pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel = panel.cor, d #But pairwise the correspondence is poor. head(cbind(hs.2, yy)> head(cbind(hs.2, yy))y1 y2 yy1 yy2 [1,] -0.4820596 -0.63189808 0.43234091 0.434382250 [2,] 1.7091204 0.41132988 0.30938540 0.259340111 [3,] 0.1155349 -0.03576782 -0.36892523 -0.368659286 [4,] 0.4143322 -0.03576782 -0.02725575 -0.005036034 [5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574 [6,] 0.8127286 0.41132988 1.29559865 1.241261763
Michael Friendly
2016-Feb-27 18:34 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
Hi Alex, Thanks for the detailed explanation and the reproducible example. But it is still not clear exactly what you wish to accomplish. You know how to calculate the scores on the canonical variates. These could be considered 'predicted scores', but in canonical space. What's wrong with that? But it seems that what you want are predicted values for y1, y2, ... from the canonical variates for the x variables ??? I don't think your question is sufficiently well-posed, but you might look at the literature on *redundancy analysis*, which takes the symmetric (Y, X) canonical correlation problem and re-formulates it into questions of how well the Ys (or Xs) are predicted. Is the problem just that the predictors and the responses are both highly correlated within each set? If so, you should probably find that there is only one strong canonical correlation, as in your example, and be done with it. You might also find that an HE plot (library (heplots)) is illuminating. The Breiman/Friedman paper you refer to can't be read on my system, and, in any case, applying shrinkage to the problem is something to consider once you have your ducks in order. best, -Michael On 2/27/2016 4:38 AM, doorz wrote:> I have a considerable interest in trying to improve the predictions from > 10+ highly correlated predictors to two (2) highly correlated responses. > > OLS does not allow me to take into account the correlation between the > responses. Multivariate can take into account the correlations between > the predictors. > > I have tried to use canonical correlations (R function cancor). > Canonical correlations nicely show the interrelations between the > responses and predictors. Predicting the responses using cancor is a > different matter. There is no predict method that supports cancor. > Simple backsubstition leads to improved correlations but poor pairwise > correspondence. (R code I use is below, for a 2 predictor, 2 response > set of data, this is a book example (Frets, 1921 headsize data), my data > is similar but can't be summarized in this post as easily) > > The 1997 paper by Breiman and Friedman gives a procedure using shrinkage > on least squares estimates obtained from canonical variates and their > origins (the responses). They call this Curds and Whey. See this post. > > ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf > > > There is a master thesis (Kidd) which uses the above and promises an R > package to handle the curds and whey procedures. That R package does not > seem to have been made public though. See here: > > http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd > > Since the Breimam and Friedman paper is almost 20 years old I am > wondering about their procedure. One would think that it would have been > adopted if it is as powerful as it is claimed to be since correlated > dependents are quite common. Is there a flaw in it that I do not know > of? Is there anything better available today? > > Any tips are most welcome. > > Regards, > Alex van der Spek > > #! usr/env/bin R > #CCA example try out > #Use data on head size of sons/brothers length and breadth > #Taken from the book by Everitt > > #Clean slate > rm(list=ls()) > > ## Panel function, put histograms on the diagonal > panel.hist <- function(x, ...) > { > usr <- par("usr"); on.exit(par(usr)) > par(usr = c(usr[1:2], 0, 1.5) ) > h <- hist(x, plot = FALSE, breaks=21) > breaks <- h$breaks; nB <- length(breaks) > y <- h$counts; y <- y/max(y) > rect(breaks[-nB], 0, breaks[-1], y, col="white", ...) > } > > ## Panel function, put correlation in top > panel.cor <- function(x, y, digits=2, prefix="", cex.cor) > { > usr <- par("usr"); on.exit(par(usr)) > par(usr = c(0, 1, 0, 1)) > r <- abs(cor(x, y)) > txt <- format(c(r, 0.123456789), digits=digits)[1] > txt <- paste(prefix, txt, sep="") > if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) > text(0.5, 0.5, txt, cex = cex.cor * r) > } > > #Bring up file selection box, set working directory and get base name > fn <- file.choose() > fs <- basename(fn) > fp <- dirname(fn) > > #Read data in by csv > hs.r <- read.csv(fn) > > > #This is the entire data set, 25 points, two predictors (x) and two > responses (y) >> head(hs.r, 25) > x1 x2 y1 y2 > 1 191 155 179 145 > 2 195 149 201 152 > 3 181 148 185 149 > 4 183 153 188 149 > 5 176 144 171 142 > 6 208 157 192 152 > 7 189 150 190 149 > 8 197 159 189 152 > 9 188 152 197 159 > 10 192 150 187 151 > 11 179 158 186 148 > 12 183 147 174 147 > 13 174 150 185 152 > 14 190 159 195 157 > 15 188 151 187 158 > 16 163 137 161 130 > 17 195 155 183 158 > 18 186 153 173 148 > 19 181 145 182 146 > 20 175 140 165 137 > 21 192 154 185 152 > 22 174 143 178 147 > 23 176 139 176 143 > 24 197 167 200 158 > 25 190 163 187 150 > > > #Make a scatter plot matrix > pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > > #Sweep out summary stats > hs.s <- scale(hs.r, center=F, scale=T) > hs.c <- scale(hs.r, center=T, scale=F) > hs.a <- scale(hs.r, center=T, scale=T) > > #Make a scatter plot matrix of centered and scaled, standardized data > pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > hs.1 <- hs.a[,1:2] > hs.2 <- hs.a[,3:4] > > cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F) > > #Compute the decompositon > #u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] * hs.1[,'x2'] > #v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] * hs.2[,'y2'] > #u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] * hs.1[,'x2'] > #v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] * hs.2[,'y2'] > > #Quicker and better > uu <- hs.1 %*% cc$xcoef > vv <- hs.2 %*% cc$ycoef > > #Store in a dataframe > uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2]) > > #Make a scatter plot matrix > pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > > #Make a linear model, the significant coefs will prove to be the > eigenvalues > lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv) > summary(lm.uv) > > #Notation x * k * A == y * B, solve for y > A <- cc$xcoef > B <- cc$ycoef > r <- cc$cor > > #Compute the back substitioon > C <- A %*% diag(r) %*% solve(B) > > yy <- hs.1 %*% C > colnames(yy) <- c('yy1', 'yy2') > > #Show how this relates to the original responses, not bad > pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel > panel.cor, d > > #But pairwise the correspondence is poor. > head(cbind(hs.2, yy) > >> head(cbind(hs.2, yy)) > y1 y2 yy1 yy2 > [1,] -0.4820596 -0.63189808 0.43234091 0.434382250 > [2,] 1.7091204 0.41132988 0.30938540 0.259340111 > [3,] 0.1155349 -0.03576782 -0.36892523 -0.368659286 > [4,] 0.4143322 -0.03576782 -0.02725575 -0.005036034 > [5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574 > [6,] 0.8127286 0.41132988 1.29559865 1.241261763 >
Michael Friendly
2016-Feb-27 19:04 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
On 2/27/2016 1:34 PM, Michael Friendly wrote:> You might also find that an HE plot (library (heplots)) > is illuminating.Follow-up: Try the following with your example library(heplots) hs.mod <- lm(cbind(y1, y2) ~ x1 + x2, data=hs.r) heplot(hs.mod, fill=TRUE) uv.mod <- lm(cbind(u1, u2) ~ v1 + v2, data = uv) heplot(uv.mod, fill=TRUE, asp=1) In the first plot, x1, x2 are highly correlated, but neither is individually significant in predicting y1, y2 by Roy's test. The second plot is the transformation to canonical space, where (u1, u2), (v1, v2) are uncorrelated and v1 is a highly significant predictor of u1. -Michael