Ana Marija
2019-Nov-14 18:50 UTC
[R] Remove highly correlated variables from a data frame or matrix
Hello, I have a data frame like this (a matrix): head(calc.rho) rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 rs35350506 0.975 0.309 0.371 0.371 0.371 0.638> dim(calc.rho)[1] 246 246 I would like to remove from this data all highly correlated variables, with correlation more than 0.8 I tried this:> data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > dim(data)[1] 246 0 Can you please advise, Thanks Ana But this removes everything.
Bert Gunter
2019-Nov-14 20:09 UTC
[R] Remove highly correlated variables from a data frame or matrix
Obvious advice: DON'T DO THIS! Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> Hello, > > I have a data frame like this (a matrix): > head(calc.rho) > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > dim(calc.rho) > [1] 246 246 > > I would like to remove from this data all highly correlated variables, > with correlation more than 0.8 > > I tried this: > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > dim(data) > [1] 246 0 > > Can you please advise, > > Thanks > Ana > > But this removes everything. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ana Marija
2019-Nov-14 20:11 UTC
[R] Remove highly correlated variables from a data frame or matrix
I don't understand. I have to keep only pairs of variables with correlation less than 0.8 in order to proceed with some calculations On Thu, Nov 14, 2019 at 2:09 PM Bert Gunter <bgunter.4567 at gmail.com> wrote:> > Obvious advice: > > DON'T DO THIS! > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >> >> Hello, >> >> I have a data frame like this (a matrix): >> head(calc.rho) >> rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 >> rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 >> rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 >> rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 >> rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 >> rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 >> rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 >> >> > dim(calc.rho) >> [1] 246 246 >> >> I would like to remove from this data all highly correlated variables, >> with correlation more than 0.8 >> >> I tried this: >> >> > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] >> > dim(data) >> [1] 246 0 >> >> Can you please advise, >> >> Thanks >> Ana >> >> But this removes everything. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Abby Spurdle
2019-Nov-14 20:29 UTC
[R] Remove highly correlated variables from a data frame or matrix
Sorry, but I don't understand your question. When I first looked at this, I thought it was a correlation (or covariance) matrix. e.g.> cor (quakes) > cov (quakes)However, your row and column variables are different, implying two different data sets. Also, some of the (correlation?) coefficients are the same, implying that some of the variables are the same, or very close. Also, note that a matrix is not a data.frame.> I have a data frame like this (a matrix): > head(calc.rho) > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > dim(calc.rho) > [1] 246 246 > I would like to remove from this data all highly correlated variables, > with correlation more than 0.8
Ana Marija
2019-Nov-14 20:42 UTC
[R] Remove highly correlated variables from a data frame or matrix
it can be converted between data frame and matrix. I am attaching here the whole file for examination I basically want to remove all entries for pairs which have value in between them (correlation calculated not in R, bit it is correlation, r2) so for example I would not keep: rs883504 because it has r2>0.8 for all those rs... rs8069610 rs883504 rs8072394 rs4280293 rs4465638 rs12602378 rs56192520 0.582 0.903 0.582 0.582 0.811 0.302 rs3764410 0.598 0.928 0.598 0.598 0.836 0.311 rs145984817 0.638 0.975 0.638 0.638 0.879 0.344 rs1807401 0.638 0.975 0.638 0.638 0.879 0.344 rs1807402 0.638 0.975 0.638 0.638 0.879 0.344 rs35350506 0.638 0.975 0.638 0.638 0.879 0.344 On Thu, Nov 14, 2019 at 2:29 PM Abby Spurdle <spurdle.a at gmail.com> wrote:> > Sorry, but I don't understand your question. > > When I first looked at this, I thought it was a correlation (or > covariance) matrix. > e.g. > > > cor (quakes) > > cov (quakes) > > However, your row and column variables are different, implying two > different data sets. > Also, some of the (correlation?) coefficients are the same, implying > that some of the variables are the same, or very close. > > Also, note that a matrix is not a data.frame. > > > > I have a data frame like this (a matrix): > > head(calc.rho) > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > dim(calc.rho) > > [1] 246 246 > > I would like to remove from this data all highly correlated variables, > > with correlation more than 0.8-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ro246_matrix.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20191114/2577162a/attachment.txt>
Jim Lemon
2019-Nov-14 21:18 UTC
[R] Remove highly correlated variables from a data frame or matrix
Hi Ana, Rather than addressing the question of why you want to do this, Let's get make the question easier to answer: calc.rho<-matrix(c(0.903,0.268,0.327,0.327,0.327,0.582, 0.928,0.276,0.336,0.336,0.336,0.598, 0.975,0.309,0.371,0.371,0.371,0.638, 0.975,0.309,0.371,0.371,0.371,0.638, 0.975,0.309,0.371,0.371,0.371,0.638, 0.975,0.309,0.371,0.371,0.371,0.638),ncol=6,byrow=TRUE) rnames<-c("rs56192520","rs3764410","rs145984817","rs1807401", "rs1807402","rs35350506") rownames(calc.rho)<-rnames cnames<-c("rs9900318","rs8069906","rs9908521","rs9908336", "rs9908870","rs9895995") colnames(calc.rho)<-cnames Now if you just want a vector of the values less than 0.8, it's trivial: calc.rho[calc.rho<0.8] However, based on your previous questions, I suspect you want something else. Maybe the pairs of row/column names that correspond to the values less than 0.8. To ensure that you haven't tricked us by not including columns in which values range around 0.8, I'll do it this way: # make the new variable name possible to decode calc.lt.8<-calc.rho<0.8 varnames.lt.8<-data.frame(var1=NA,var2=NA) for(row in 1:nrow(calc.rho)) { for(col in 1:ncol(calc.rho)) if(calc.lt.8[row,col]) varnames.lt.8<-rbind(varnames.lt.8,c(rnames[row],cnames[col])) } # now get rid of the first row of NA values varnames.lt.8<-varnames.lt.8[-1,] Clunky, but effective. You now have those variable pairs that you may want. Let us know in the next episode of this soap operation. Jim On Fri, Nov 15, 2019 at 5:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hello, > > I have a data frame like this (a matrix): > head(calc.rho) > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > dim(calc.rho) > [1] 246 246 > > I would like to remove from this data all highly correlated variables, > with correlation more than 0.8 > > I tried this: > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > dim(data) > [1] 246 0 > > Can you please advise, > > Thanks > Ana > > But this removes everything. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Peter Langfelder
2019-Nov-15 01:37 UTC
[R] Remove highly correlated variables from a data frame or matrix
I suspect that you want to identify which variables are highly correlated, and then keep only "representative" variables, i.e., remove redundant ones. This is a bit of a risky procedure but I have done such things before as well sometimes to simplify large sets of highly related variables. If your threshold of 0.8 is approximate, you could simply use average linkage hierarchical clustering with dissimilarity = 1-correlation, cut the tree at the appropriate height (1-0.8=0.2), and from each cluster keep a single representative (e.g., the one with the highest mean correlation with other members of the cluster). Something along these lines (untested) tree = hclust(1-calc.rho, method = "average") clusts = cutree(tree, h = 0.2) clustLevels = sort(unique(clusts)) representatives = unlist(lapply(clustLevels, function(cl) { inClust = which(clusts==cl); rho1 = calc.rho[inClust, inClust, drop = FALSE]; repr = inClust[ which.max(colSums(rho1)) ] repr })) the variable representatives now contains indices of the variables you want to retain, so you could subset the calc.rho matrix as rho.retained = calc.rho[representatives, representatives] I haven't tested the code and it may contain bugs, but something along these lines should get you where you want to be. Oh, and depending on how strict you want to be with the remaining correlations, you could use complete linkage clustering (will retain more variables, some correlations will be above 0.8). Peter On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > Hello, > > I have a data frame like this (a matrix): > head(calc.rho) > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > dim(calc.rho) > [1] 246 246 > > I would like to remove from this data all highly correlated variables, > with correlation more than 0.8 > > I tried this: > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > dim(data) > [1] 246 0 > > Can you please advise, > > Thanks > Ana > > But this removes everything. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2019-Nov-15 18:03 UTC
[R] Remove highly correlated variables from a data frame or matrix
HI Peter, Thank you for getting back to me and shedding light on this. I see your point, doing Jim's method:> keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > ro246.lt.8<-calc.rho[keeprows,keeprows] > ro246.lt.8[ro246.lt.8 == 1] <- NA > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE))[1] 0.566 Which is good in general, correlations in my matrix should not be exceeding 0.8. I need to run Mendelian Rendomization on it later on so I can not be having there highly correlated SNPs. But with Jim's method I am only left with 17 SNPs (out of 246) and that means that both pairs of highly correlated SNPs are removed and it would be good to keep one of those highly correlated ones. I tried to do your code:> tree = hclust(1-calc.rho, method = "average")Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed Please advise. Thanks Ana On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder <peter.langfelder at gmail.com> wrote:> > I suspect that you want to identify which variables are highly > correlated, and then keep only "representative" variables, i.e., > remove redundant ones. This is a bit of a risky procedure but I have > done such things before as well sometimes to simplify large sets of > highly related variables. If your threshold of 0.8 is approximate, you > could simply use average linkage hierarchical clustering with > dissimilarity = 1-correlation, cut the tree at the appropriate height > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > the one with the highest mean correlation with other members of the > cluster). Something along these lines (untested) > > tree = hclust(1-calc.rho, method = "average") > clusts = cutree(tree, h = 0.2) > clustLevels = sort(unique(clusts)) > representatives = unlist(lapply(clustLevels, function(cl) > { > inClust = which(clusts==cl); > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > repr = inClust[ which.max(colSums(rho1)) ] > repr > })) > > the variable representatives now contains indices of the variables you > want to retain, so you could subset the calc.rho matrix as > rho.retained = calc.rho[representatives, representatives] > > I haven't tested the code and it may contain bugs, but something along > these lines should get you where you want to be. > > Oh, and depending on how strict you want to be with the remaining > correlations, you could use complete linkage clustering (will retain > more variables, some correlations will be above 0.8). > > Peter > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > Hello, > > > > I have a data frame like this (a matrix): > > head(calc.rho) > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > dim(calc.rho) > > [1] 246 246 > > > > I would like to remove from this data all highly correlated variables, > > with correlation more than 0.8 > > > > I tried this: > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > dim(data) > > [1] 246 0 > > > > Can you please advise, > > > > Thanks > > Ana > > > > But this removes everything. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.