doorz
2016-Feb-27 09:38 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
I have a considerable interest in trying to improve the predictions from
10+ highly correlated predictors to two (2) highly correlated responses.
OLS does not allow me to take into account the correlation between the
responses. Multivariate can take into account the correlations between
the predictors.
I have tried to use canonical correlations (R function cancor).
Canonical correlations nicely show the interrelations between the
responses and predictors. Predicting the responses using cancor is a
different matter. There is no predict method that supports cancor.
Simple backsubstition leads to improved correlations but poor pairwise
correspondence. (R code I use is below, for a 2 predictor, 2 response
set of data, this is a book example (Frets, 1921 headsize data), my data
is similar but can't be summarized in this post as easily)
The 1997 paper by Breiman and Friedman gives a procedure using shrinkage
on least squares estimates obtained from canonical variates and their
origins (the responses). They call this Curds and Whey. See this post.
ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf
There is a master thesis (Kidd) which uses the above and promises an R
package to handle the curds and whey procedures. That R package does not
seem to have been made public though. See here:
http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd
Since the Breimam and Friedman paper is almost 20 years old I am
wondering about their procedure. One would think that it would have been
adopted if it is as powerful as it is claimed to be since correlated
dependents are quite common. Is there a flaw in it that I do not know
of? Is there anything better available today?
Any tips are most welcome.
Regards,
Alex van der Spek
#! usr/env/bin R
#CCA example try out
#Use data on head size of sons/brothers length and breadth
#Taken from the book by Everitt
#Clean slate
rm(list=ls())
## Panel function, put histograms on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE, breaks=21)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}
## Panel function, put correlation in top
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
#Bring up file selection box, set working directory and get base name
fn <- file.choose()
fs <- basename(fn)
fp <- dirname(fn)
#Read data in by csv
hs.r <- read.csv(fn)
#This is the entire data set, 25 points, two predictors (x) and two
responses (y)> head(hs.r, 25)
x1 x2 y1 y2
1 191 155 179 145
2 195 149 201 152
3 181 148 185 149
4 183 153 188 149
5 176 144 171 142
6 208 157 192 152
7 189 150 190 149
8 197 159 189 152
9 188 152 197 159
10 192 150 187 151
11 179 158 186 148
12 183 147 174 147
13 174 150 185 152
14 190 159 195 157
15 188 151 187 158
16 163 137 161 130
17 195 155 183 158
18 186 153 173 148
19 181 145 182 146
20 175 140 165 137
21 192 154 185 152
22 174 143 178 147
23 176 139 176 143
24 197 167 200 158
25 190 163 187 150
#Make a scatter plot matrix
pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor,
diag.panel = panel.hist)
#Sweep out summary stats
hs.s <- scale(hs.r, center=F, scale=T)
hs.c <- scale(hs.r, center=T, scale=F)
hs.a <- scale(hs.r, center=T, scale=T)
#Make a scatter plot matrix of centered and scaled, standardized data
pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor,
diag.panel = panel.hist)
hs.1 <- hs.a[,1:2]
hs.2 <- hs.a[,3:4]
cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F)
#Compute the decompositon
#u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] *
hs.1[,'x2']
#v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] *
hs.2[,'y2']
#u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] *
hs.1[,'x2']
#v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] *
hs.2[,'y2']
#Quicker and better
uu <- hs.1 %*% cc$xcoef
vv <- hs.2 %*% cc$ycoef
#Store in a dataframe
uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2])
#Make a scatter plot matrix
pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor,
diag.panel = panel.hist)
#Make a linear model, the significant coefs will prove to be the
eigenvalues
lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv)
summary(lm.uv)
#Notation x * k * A == y * B, solve for y
A <- cc$xcoef
B <- cc$ycoef
r <- cc$cor
#Compute the back substitioon
C <- A %*% diag(r) %*% solve(B)
yy <- hs.1 %*% C
colnames(yy) <- c('yy1', 'yy2')
#Show how this relates to the original responses, not bad
pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel =
panel.cor, d
#But pairwise the correspondence is poor.
head(cbind(hs.2, yy)
> head(cbind(hs.2, yy))
y1 y2 yy1 yy2
[1,] -0.4820596 -0.63189808 0.43234091 0.434382250
[2,] 1.7091204 0.41132988 0.30938540 0.259340111
[3,] 0.1155349 -0.03576782 -0.36892523 -0.368659286
[4,] 0.4143322 -0.03576782 -0.02725575 -0.005036034
[5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574
[6,] 0.8127286 0.41132988 1.29559865 1.241261763
Michael Friendly
2016-Feb-27 18:34 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
Hi Alex, Thanks for the detailed explanation and the reproducible example. But it is still not clear exactly what you wish to accomplish. You know how to calculate the scores on the canonical variates. These could be considered 'predicted scores', but in canonical space. What's wrong with that? But it seems that what you want are predicted values for y1, y2, ... from the canonical variates for the x variables ??? I don't think your question is sufficiently well-posed, but you might look at the literature on *redundancy analysis*, which takes the symmetric (Y, X) canonical correlation problem and re-formulates it into questions of how well the Ys (or Xs) are predicted. Is the problem just that the predictors and the responses are both highly correlated within each set? If so, you should probably find that there is only one strong canonical correlation, as in your example, and be done with it. You might also find that an HE plot (library (heplots)) is illuminating. The Breiman/Friedman paper you refer to can't be read on my system, and, in any case, applying shrinkage to the problem is something to consider once you have your ducks in order. best, -Michael On 2/27/2016 4:38 AM, doorz wrote:> I have a considerable interest in trying to improve the predictions from > 10+ highly correlated predictors to two (2) highly correlated responses. > > OLS does not allow me to take into account the correlation between the > responses. Multivariate can take into account the correlations between > the predictors. > > I have tried to use canonical correlations (R function cancor). > Canonical correlations nicely show the interrelations between the > responses and predictors. Predicting the responses using cancor is a > different matter. There is no predict method that supports cancor. > Simple backsubstition leads to improved correlations but poor pairwise > correspondence. (R code I use is below, for a 2 predictor, 2 response > set of data, this is a book example (Frets, 1921 headsize data), my data > is similar but can't be summarized in this post as easily) > > The 1997 paper by Breiman and Friedman gives a procedure using shrinkage > on least squares estimates obtained from canonical variates and their > origins (the responses). They call this Curds and Whey. See this post. > > ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf > > > There is a master thesis (Kidd) which uses the above and promises an R > package to handle the curds and whey procedures. That R package does not > seem to have been made public though. See here: > > http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd > > Since the Breimam and Friedman paper is almost 20 years old I am > wondering about their procedure. One would think that it would have been > adopted if it is as powerful as it is claimed to be since correlated > dependents are quite common. Is there a flaw in it that I do not know > of? Is there anything better available today? > > Any tips are most welcome. > > Regards, > Alex van der Spek > > #! usr/env/bin R > #CCA example try out > #Use data on head size of sons/brothers length and breadth > #Taken from the book by Everitt > > #Clean slate > rm(list=ls()) > > ## Panel function, put histograms on the diagonal > panel.hist <- function(x, ...) > { > usr <- par("usr"); on.exit(par(usr)) > par(usr = c(usr[1:2], 0, 1.5) ) > h <- hist(x, plot = FALSE, breaks=21) > breaks <- h$breaks; nB <- length(breaks) > y <- h$counts; y <- y/max(y) > rect(breaks[-nB], 0, breaks[-1], y, col="white", ...) > } > > ## Panel function, put correlation in top > panel.cor <- function(x, y, digits=2, prefix="", cex.cor) > { > usr <- par("usr"); on.exit(par(usr)) > par(usr = c(0, 1, 0, 1)) > r <- abs(cor(x, y)) > txt <- format(c(r, 0.123456789), digits=digits)[1] > txt <- paste(prefix, txt, sep="") > if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) > text(0.5, 0.5, txt, cex = cex.cor * r) > } > > #Bring up file selection box, set working directory and get base name > fn <- file.choose() > fs <- basename(fn) > fp <- dirname(fn) > > #Read data in by csv > hs.r <- read.csv(fn) > > > #This is the entire data set, 25 points, two predictors (x) and two > responses (y) >> head(hs.r, 25) > x1 x2 y1 y2 > 1 191 155 179 145 > 2 195 149 201 152 > 3 181 148 185 149 > 4 183 153 188 149 > 5 176 144 171 142 > 6 208 157 192 152 > 7 189 150 190 149 > 8 197 159 189 152 > 9 188 152 197 159 > 10 192 150 187 151 > 11 179 158 186 148 > 12 183 147 174 147 > 13 174 150 185 152 > 14 190 159 195 157 > 15 188 151 187 158 > 16 163 137 161 130 > 17 195 155 183 158 > 18 186 153 173 148 > 19 181 145 182 146 > 20 175 140 165 137 > 21 192 154 185 152 > 22 174 143 178 147 > 23 176 139 176 143 > 24 197 167 200 158 > 25 190 163 187 150 > > > #Make a scatter plot matrix > pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > > #Sweep out summary stats > hs.s <- scale(hs.r, center=F, scale=T) > hs.c <- scale(hs.r, center=T, scale=F) > hs.a <- scale(hs.r, center=T, scale=T) > > #Make a scatter plot matrix of centered and scaled, standardized data > pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > hs.1 <- hs.a[,1:2] > hs.2 <- hs.a[,3:4] > > cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F) > > #Compute the decompositon > #u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] * hs.1[,'x2'] > #v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] * hs.2[,'y2'] > #u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] * hs.1[,'x2'] > #v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] * hs.2[,'y2'] > > #Quicker and better > uu <- hs.1 %*% cc$xcoef > vv <- hs.2 %*% cc$ycoef > > #Store in a dataframe > uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2]) > > #Make a scatter plot matrix > pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor, > diag.panel = panel.hist) > > #Make a linear model, the significant coefs will prove to be the > eigenvalues > lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv) > summary(lm.uv) > > #Notation x * k * A == y * B, solve for y > A <- cc$xcoef > B <- cc$ycoef > r <- cc$cor > > #Compute the back substitioon > C <- A %*% diag(r) %*% solve(B) > > yy <- hs.1 %*% C > colnames(yy) <- c('yy1', 'yy2') > > #Show how this relates to the original responses, not bad > pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel > panel.cor, d > > #But pairwise the correspondence is poor. > head(cbind(hs.2, yy) > >> head(cbind(hs.2, yy)) > y1 y2 yy1 yy2 > [1,] -0.4820596 -0.63189808 0.43234091 0.434382250 > [2,] 1.7091204 0.41132988 0.30938540 0.259340111 > [3,] 0.1155349 -0.03576782 -0.36892523 -0.368659286 > [4,] 0.4143322 -0.03576782 -0.02725575 -0.005036034 > [5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574 > [6,] 0.8127286 0.41132988 1.29559865 1.241261763 >
Michael Friendly
2016-Feb-27 19:04 UTC
[R] Predicting correlated responses using canonical correlations in multivariate least squares
On 2/27/2016 1:34 PM, Michael Friendly wrote:> You might also find that an HE plot (library (heplots)) > is illuminating.Follow-up: Try the following with your example library(heplots) hs.mod <- lm(cbind(y1, y2) ~ x1 + x2, data=hs.r) heplot(hs.mod, fill=TRUE) uv.mod <- lm(cbind(u1, u2) ~ v1 + v2, data = uv) heplot(uv.mod, fill=TRUE, asp=1) In the first plot, x1, x2 are highly correlated, but neither is individually significant in predicting y1, y2 by Roy's test. The second plot is the transformation to canonical space, where (u1, u2), (v1, v2) are uncorrelated and v1 is a highly significant predictor of u1. -Michael