thr3ads.net - R help - [R] Predicting correlated responses using canonical correlations in multivariate least squares [Feb 2016]

If this information is useful, please help other people find it:
Share via:

doorz

2016-Feb-27 09:38 UTC

[R] Predicting correlated responses using canonical correlations in multivariate least squares

I have a considerable interest in trying to improve the predictions from 
10+ highly correlated predictors to two (2) highly correlated responses.

OLS does not allow me to take into account the correlation between the 
responses. Multivariate can take into account the correlations between 
the predictors.

I have tried to use canonical correlations (R function cancor). 
Canonical correlations nicely show the interrelations between the 
responses and predictors. Predicting the responses using cancor is a 
different matter. There is no predict method that supports cancor. 
Simple backsubstition leads to improved correlations but poor pairwise 
correspondence. (R code I use is below, for a 2 predictor, 2 response 
set of data, this is a book example (Frets, 1921 headsize data), my data 
is similar but can't be summarized in this post as easily)

The 1997 paper by Breiman and Friedman gives a procedure using shrinkage 
on least squares estimates obtained from canonical variates and their 
origins (the responses). They call this Curds and Whey. See  this post.

ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf

There is a master thesis (Kidd) which uses the above and promises an R 
package to handle the curds and whey procedures. That R package does not 
seem to have been made public though. See here:

http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd

Since the Breimam and Friedman paper is almost 20 years old I am 
wondering about their procedure. One would think that it would have been 
adopted if it is as powerful as it is claimed to be since correlated 
dependents are quite common. Is there a flaw in it that I do not know 
of? Is there anything better available today?

Any tips are most welcome.

Regards,
Alex van der Spek

#! usr/env/bin R
#CCA example try out
#Use data on head size of sons/brothers length and breadth
#Taken from the book by Everitt

#Clean slate
rm(list=ls())

## Panel function, put histograms on the diagonal
panel.hist <- function(x, ...)
{
     usr <- par("usr"); on.exit(par(usr))
     par(usr = c(usr[1:2], 0, 1.5) )
     h <- hist(x, plot = FALSE, breaks=21)
     breaks <- h$breaks; nB <- length(breaks)
     y <- h$counts; y <- y/max(y)
     rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}

## Panel function, put correlation in top
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
     usr <- par("usr"); on.exit(par(usr))
     par(usr = c(0, 1, 0, 1))
     r <- abs(cor(x, y))
     txt <- format(c(r, 0.123456789), digits=digits)[1]
     txt <- paste(prefix, txt, sep="")
     if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
     text(0.5, 0.5, txt, cex = cex.cor * r)
}

#Bring up file selection box, set working directory and get base name
fn <- file.choose()
fs <- basename(fn)
fp <- dirname(fn)

#Read data in by csv
hs.r <- read.csv(fn)


#This is the entire data set, 25 points, two predictors (x) and two 
responses (y)> head(hs.r, 25)     x1  x2  y1  y2
1  191 155 179 145
2  195 149 201 152
3  181 148 185 149
4  183 153 188 149
5  176 144 171 142
6  208 157 192 152
7  189 150 190 149
8  197 159 189 152
9  188 152 197 159
10 192 150 187 151
11 179 158 186 148
12 183 147 174 147
13 174 150 185 152
14 190 159 195 157
15 188 151 187 158
16 163 137 161 130
17 195 155 183 158
18 186 153 173 148
19 181 145 182 146
20 175 140 165 137
21 192 154 185 152
22 174 143 178 147
23 176 139 176 143
24 197 167 200 158
25 190 163 187 150


#Make a scatter plot matrix
pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)

#Sweep out summary stats
hs.s <- scale(hs.r, center=F, scale=T)
hs.c <- scale(hs.r, center=T, scale=F)
hs.a <- scale(hs.r, center=T, scale=T)

#Make a scatter plot matrix of centered and scaled, standardized data
pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)
hs.1 <- hs.a[,1:2]
hs.2 <- hs.a[,3:4]

cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F)

#Compute the decompositon
#u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] *
hs.1[,'x2']
#v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] *
hs.2[,'y2']
#u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] *
hs.1[,'x2']
#v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] *
hs.2[,'y2']

#Quicker and better
uu <- hs.1 %*% cc$xcoef
vv <- hs.2 %*% cc$ycoef

#Store in a dataframe
uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2])

#Make a scatter plot matrix
pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor, 
diag.panel = panel.hist)

#Make a linear model, the significant coefs will prove to be the 
eigenvalues
lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv)
summary(lm.uv)

#Notation  x * k * A  == y * B, solve for y
A <- cc$xcoef
B <- cc$ycoef
r <- cc$cor

#Compute the back substitioon
C <-  A %*% diag(r) %*% solve(B)

yy <- hs.1 %*% C
colnames(yy) <- c('yy1', 'yy2')

#Show how this relates to the original responses, not bad
pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel = 
panel.cor, d

#But pairwise the correspondence is poor.
head(cbind(hs.2, yy)
> head(cbind(hs.2, yy))              y1          y2         yy1          yy2
[1,] -0.4820596 -0.63189808  0.43234091  0.434382250
[2,]  1.7091204  0.41132988  0.30938540  0.259340111
[3,]  0.1155349 -0.03576782 -0.36892523 -0.368659286
[4,]  0.4143322 -0.03576782 -0.02725575 -0.005036034
[5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574
[6,]  0.8127286  0.41132988  1.29559865  1.241261763

Michael Friendly

2016-Feb-27 18:34 UTC

head link

[R] Predicting correlated responses using canonical correlations in multivariate least squares

Hi Alex,

Thanks for the detailed explanation and the reproducible example.
But it is still not clear exactly what you wish to accomplish.

You know how to calculate the scores on the canonical variates.
These could be considered 'predicted scores', but in canonical
space.  What's wrong with that?
But it seems that what you want are predicted values
for y1, y2, ... from the canonical variates for the x variables ???

I don't think your question is sufficiently well-posed, but you
might look at the literature on *redundancy analysis*, which takes
the symmetric (Y, X) canonical correlation problem and re-formulates
it into questions of how well the Ys (or Xs) are predicted.

Is the problem just that the predictors and the responses
are both highly correlated within each set?  If so, you
should probably find that there is only one strong canonical
correlation, as in your example, and be done with it.
You might also find that an HE plot (library (heplots))
is illuminating.

The Breiman/Friedman paper you refer to can't be read on my
system, and, in any case, applying shrinkage to the problem
is something to consider once you have your ducks in order.

best,
-Michael



On 2/27/2016 4:38 AM, doorz wrote:> I have a considerable interest in trying to improve the predictions from
> 10+ highly correlated predictors to two (2) highly correlated responses.
>
> OLS does not allow me to take into account the correlation between the
> responses. Multivariate can take into account the correlations between
> the predictors.
>
> I have tried to use canonical correlations (R function cancor).
> Canonical correlations nicely show the interrelations between the
> responses and predictors. Predicting the responses using cancor is a
> different matter. There is no predict method that supports cancor.
> Simple backsubstition leads to improved correlations but poor pairwise
> correspondence. (R code I use is below, for a 2 predictor, 2 response
> set of data, this is a book example (Frets, 1921 headsize data), my data
> is similar but can't be summarized in this post as easily)
>
> The 1997 paper by Breiman and Friedman gives a procedure using shrinkage
> on least squares estimates obtained from canonical variates and their
> origins (the responses). They call this Curds and Whey. See  this post.
>
>
ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf
>
>
> There is a master thesis (Kidd) which uses the above and promises an R
> package to handle the curds and whey procedures. That R package does not
> seem to have been made public though. See here:
>
>
http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd
>
> Since the Breimam and Friedman paper is almost 20 years old I am
> wondering about their procedure. One would think that it would have been
> adopted if it is as powerful as it is claimed to be since correlated
> dependents are quite common. Is there a flaw in it that I do not know
> of? Is there anything better available today?
>
> Any tips are most welcome.
>
> Regards,
> Alex van der Spek
>
> #! usr/env/bin R
> #CCA example try out
> #Use data on head size of sons/brothers length and breadth
> #Taken from the book by Everitt
>
> #Clean slate
> rm(list=ls())
>
> ## Panel function, put histograms on the diagonal
> panel.hist <- function(x, ...)
> {
>      usr <- par("usr"); on.exit(par(usr))
>      par(usr = c(usr[1:2], 0, 1.5) )
>      h <- hist(x, plot = FALSE, breaks=21)
>      breaks <- h$breaks; nB <- length(breaks)
>      y <- h$counts; y <- y/max(y)
>      rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
> }
>
> ## Panel function, put correlation in top
> panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
> {
>      usr <- par("usr"); on.exit(par(usr))
>      par(usr = c(0, 1, 0, 1))
>      r <- abs(cor(x, y))
>      txt <- format(c(r, 0.123456789), digits=digits)[1]
>      txt <- paste(prefix, txt, sep="")
>      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
>      text(0.5, 0.5, txt, cex = cex.cor * r)
> }
>
> #Bring up file selection box, set working directory and get base name
> fn <- file.choose()
> fs <- basename(fn)
> fp <- dirname(fn)
>
> #Read data in by csv
> hs.r <- read.csv(fn)
>
>
> #This is the entire data set, 25 points, two predictors (x) and two
> responses (y)
>> head(hs.r, 25)
>      x1  x2  y1  y2
> 1  191 155 179 145
> 2  195 149 201 152
> 3  181 148 185 149
> 4  183 153 188 149
> 5  176 144 171 142
> 6  208 157 192 152
> 7  189 150 190 149
> 8  197 159 189 152
> 9  188 152 197 159
> 10 192 150 187 151
> 11 179 158 186 148
> 12 183 147 174 147
> 13 174 150 185 152
> 14 190 159 195 157
> 15 188 151 187 158
> 16 163 137 161 130
> 17 195 155 183 158
> 18 186 153 173 148
> 19 181 145 182 146
> 20 175 140 165 137
> 21 192 154 185 152
> 22 174 143 178 147
> 23 176 139 176 143
> 24 197 167 200 158
> 25 190 163 187 150
>
>
> #Make a scatter plot matrix
> pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
>
> #Sweep out summary stats
> hs.s <- scale(hs.r, center=F, scale=T)
> hs.c <- scale(hs.r, center=T, scale=F)
> hs.a <- scale(hs.r, center=T, scale=T)
>
> #Make a scatter plot matrix of centered and scaled, standardized data
> pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
> hs.1 <- hs.a[,1:2]
> hs.2 <- hs.a[,3:4]
>
> cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F)
>
> #Compute the decompositon
> #u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] *
hs.1[,'x2']
> #v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] *
hs.2[,'y2']
> #u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] *
hs.1[,'x2']
> #v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] *
hs.2[,'y2']
>
> #Quicker and better
> uu <- hs.1 %*% cc$xcoef
> vv <- hs.2 %*% cc$ycoef
>
> #Store in a dataframe
> uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2])
>
> #Make a scatter plot matrix
> pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
>
> #Make a linear model, the significant coefs will prove to be the
> eigenvalues
> lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv)
> summary(lm.uv)
>
> #Notation  x * k * A  == y * B, solve for y
> A <- cc$xcoef
> B <- cc$ycoef
> r <- cc$cor
>
> #Compute the back substitioon
> C <-  A %*% diag(r) %*% solve(B)
>
> yy <- hs.1 %*% C
> colnames(yy) <- c('yy1', 'yy2')
>
> #Show how this relates to the original responses, not bad
> pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel >
panel.cor, d
>
> #But pairwise the correspondence is poor.
> head(cbind(hs.2, yy)
>
>> head(cbind(hs.2, yy))
>               y1          y2         yy1          yy2
> [1,] -0.4820596 -0.63189808  0.43234091  0.434382250
> [2,]  1.7091204  0.41132988  0.30938540  0.259340111
> [3,]  0.1155349 -0.03576782 -0.36892523 -0.368659286
> [4,]  0.4143322 -0.03576782 -0.02725575 -0.005036034
> [5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574
> [6,]  0.8127286  0.41132988  1.29559865  1.241261763
>

Michael Friendly

2016-Feb-27 19:04 UTC

head link

[R] Predicting correlated responses using canonical correlations in multivariate least squares

On 2/27/2016 1:34 PM, Michael Friendly wrote:> You might also find that an HE plot (library (heplots))
> is illuminating.
Follow-up:  Try the following with your example

library(heplots)
hs.mod <- lm(cbind(y1, y2) ~ x1 + x2, data=hs.r)
heplot(hs.mod, fill=TRUE)

uv.mod <- lm(cbind(u1, u2) ~ v1 + v2, data = uv)
heplot(uv.mod, fill=TRUE, asp=1)

In the first plot, x1, x2 are highly correlated, but neither is 
individually significant in predicting y1, y2 by Roy's test.

The second plot is the transformation to canonical space,
where (u1, u2), (v1, v2) are uncorrelated and v1 is a highly
significant predictor of u1.

-Michael

R help - Feb 2016 - Predicting correlated responses using canonical correlations in multivariate least squares

[R] Predicting correlated responses using canonical correlations in multivariate least squares

[R] Predicting correlated responses using canonical correlations in multivariate least squares

[R] Predicting correlated responses using canonical correlations in multivariate least squares