Ana Marija
2019-Nov-15 18:03 UTC
[R] Remove highly correlated variables from a data frame or matrix
HI Peter, Thank you for getting back to me and shedding light on this. I see your point, doing Jim's method:> keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > ro246.lt.8<-calc.rho[keeprows,keeprows] > ro246.lt.8[ro246.lt.8 == 1] <- NA > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE))[1] 0.566 Which is good in general, correlations in my matrix should not be exceeding 0.8. I need to run Mendelian Rendomization on it later on so I can not be having there highly correlated SNPs. But with Jim's method I am only left with 17 SNPs (out of 246) and that means that both pairs of highly correlated SNPs are removed and it would be good to keep one of those highly correlated ones. I tried to do your code:> tree = hclust(1-calc.rho, method = "average")Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed Please advise. Thanks Ana On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder <peter.langfelder at gmail.com> wrote:> > I suspect that you want to identify which variables are highly > correlated, and then keep only "representative" variables, i.e., > remove redundant ones. This is a bit of a risky procedure but I have > done such things before as well sometimes to simplify large sets of > highly related variables. If your threshold of 0.8 is approximate, you > could simply use average linkage hierarchical clustering with > dissimilarity = 1-correlation, cut the tree at the appropriate height > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > the one with the highest mean correlation with other members of the > cluster). Something along these lines (untested) > > tree = hclust(1-calc.rho, method = "average") > clusts = cutree(tree, h = 0.2) > clustLevels = sort(unique(clusts)) > representatives = unlist(lapply(clustLevels, function(cl) > { > inClust = which(clusts==cl); > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > repr = inClust[ which.max(colSums(rho1)) ] > repr > })) > > the variable representatives now contains indices of the variables you > want to retain, so you could subset the calc.rho matrix as > rho.retained = calc.rho[representatives, representatives] > > I haven't tested the code and it may contain bugs, but something along > these lines should get you where you want to be. > > Oh, and depending on how strict you want to be with the remaining > correlations, you could use complete linkage clustering (will retain > more variables, some correlations will be above 0.8). > > Peter > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > Hello, > > > > I have a data frame like this (a matrix): > > head(calc.rho) > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > dim(calc.rho) > > [1] 246 246 > > > > I would like to remove from this data all highly correlated variables, > > with correlation more than 0.8 > > > > I tried this: > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > dim(data) > > [1] 246 0 > > > > Can you please advise, > > > > Thanks > > Ana > > > > But this removes everything. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2019-Nov-15 18:31 UTC
[R] Remove highly correlated variables from a data frame or matrix
if it is of any help my correlation matrix (calc.rho) was done here, under LDmatrix tab https://ldlink.nci.nih.gov/?tab=ldmatrix and dataset of 246 is bellow rs56192520 rs3764410 rs145984817 rs1807401 rs1807402 rs35350506 rs2089177 rs12325677 rs62064624 rs62064631 rs2349295 rs2174369 rs7218554 rs62064634 rs4360974 rs4527060 rs6502526 rs6502527 rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 rs7211086 rs9905280 rs8073305 rs8072086 rs4312350 rs4313843 rs8069610 rs883504 rs8072394 rs4280293 rs4465638 rs12602378 rs9899059 rs6502530 rs4380085 rs6502532 rs4792798 rs4792799 rs4316813 rs148563931 rs74751226 rs8068857 rs8069441 rs77397878 rs75339756 rs4608391 rs79569548 rs4275914 rs11870422 rs8075751 rs11658904 rs138437542 rs80344434 rs7222311 rs7221842 rs7223686 rs78013597 rs74965036 rs78063986 rs118106233 rs117345712 rs113004656 rs9898995 rs4985718 rs9893911 rs79110942 rs7208929 rs12601453 rs4078062 rs75129280 rs76664572 rs78961289 rs146364798 rs76715413 rs4078534 rs79457460 rs74369938 rs76423171 rs74668400 rs75146120 rs1135237 rs9914671 rs117759512 rs4985696 rs16961340 rs17794159 rs4247118 rs78572469 rs12601193 rs2349646 rs2090018 rs12601424 rs4985701 rs8064550 rs2271521 rs2271520 rs11078374 rs4985702 rs1124961 rs11652674 rs3924340 rs112450164 rs7208973 rs9910857 rs78574480 rs8072184 rs12602196 rs6502563 rs3744135 rs148779543 rs77689691 rs41319048 rs117340532 rs78647096 rs77712968 rs16961396 rs80054920 rs7206981 rs4985740 rs3803762 rs77103270 rs7207485 rs77342773 rs3826304 rs3744126 rs7210879 rs7211576 rs117967362 rs75978745 rs6502564 rs9894565 rs36079048 rs8076621 rs7218795 rs3803761 rs12602675 rs7208065 rs4985705 rs8080386 rs8065832 rs2018781 rs1736221 rs1736220 rs1736217 rs1708620 rs1708619 rs1736216 rs76319098 rs1736215 rs1736214 rs1708617 rs12602831 rs12602871 rs1736213 rs1736212 rs76045368 rs34518797 rs11078378 rs8079562 rs8065774 rs8066090 rs41337846 rs1736209 rs1736208 rs12949822 rs76246042 rs12600635 rs55689224 rs1736207 rs1708626 rs1736206 rs9896078 rs16961474 rs1708627 rs1736205 rs1708628 rs7220577 rs2294155 rs1736204 rs1736203 rs1736202 rs12937908 rs1736200 rs1708623 rs1708624 rs9894884 rs9901894 rs9903294 rs2472689 rs1630656 rs111478970 rs3182911 rs7219012 rs9890657 rs12453455 rs12947291 rs150267386 rs16961493 rs11652745 rs9907107 rs8070574 rs4985759 rs3866959 rs7219248 rs6502568 rs7220275 rs12450037 rs7225876 rs9892352 rs4985760 rs6502569 rs1029830 rs2012954 rs1029832 rs2270180 rs8072402 rs7221553 rs145597919 rs150772017 rs2041393 rs6502578 rs11078382 rs9912109 rs12601631 rs11869054 rs11869079 rs9912599 rs7220057 rs9896970 rs34121330 rs34668117 rs67773570 rs242252 rs955893 rs28583584 rs9944423 rs7217764 rs11651957 rs73978990 rs8071007 rs56044345 rs17804843 On Fri, Nov 15, 2019 at 12:03 PM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > HI Peter, > > Thank you for getting back to me and shedding light on this. I see > your point, doing Jim's method: > > > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > > ro246.lt.8<-calc.rho[keeprows,keeprows] > > ro246.lt.8[ro246.lt.8 == 1] <- NA > > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE)) > [1] 0.566 > > Which is good in general, correlations in my matrix should not be > exceeding 0.8. I need to run Mendelian Rendomization on it later on so > I can not be having there highly correlated SNPs. But with Jim's > method I am only left with 17 SNPs (out of 246) and that means that > both pairs of highly correlated SNPs are removed and it would be good > to keep one of those highly correlated ones. > > I tried to do your code: > > tree = hclust(1-calc.rho, method = "average") > Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor > exceed 65536") : > missing value where TRUE/FALSE needed > > Please advise. > > Thanks > Ana > > On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder > <peter.langfelder at gmail.com> wrote: > > > > I suspect that you want to identify which variables are highly > > correlated, and then keep only "representative" variables, i.e., > > remove redundant ones. This is a bit of a risky procedure but I have > > done such things before as well sometimes to simplify large sets of > > highly related variables. If your threshold of 0.8 is approximate, you > > could simply use average linkage hierarchical clustering with > > dissimilarity = 1-correlation, cut the tree at the appropriate height > > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > > the one with the highest mean correlation with other members of the > > cluster). Something along these lines (untested) > > > > tree = hclust(1-calc.rho, method = "average") > > clusts = cutree(tree, h = 0.2) > > clustLevels = sort(unique(clusts)) > > representatives = unlist(lapply(clustLevels, function(cl) > > { > > inClust = which(clusts==cl); > > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > > repr = inClust[ which.max(colSums(rho1)) ] > > repr > > })) > > > > the variable representatives now contains indices of the variables you > > want to retain, so you could subset the calc.rho matrix as > > rho.retained = calc.rho[representatives, representatives] > > > > I haven't tested the code and it may contain bugs, but something along > > these lines should get you where you want to be. > > > > Oh, and depending on how strict you want to be with the remaining > > correlations, you could use complete linkage clustering (will retain > > more variables, some correlations will be above 0.8). > > > > Peter > > > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > > > Hello, > > > > > > I have a data frame like this (a matrix): > > > head(calc.rho) > > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > > > dim(calc.rho) > > > [1] 246 246 > > > > > > I would like to remove from this data all highly correlated variables, > > > with correlation more than 0.8 > > > > > > I tried this: > > > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > > dim(data) > > > [1] 246 0 > > > > > > Can you please advise, > > > > > > Thanks > > > Ana > > > > > > But this removes everything. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2019-Nov-15 20:17 UTC
[R] Remove highly correlated variables from a data frame or matrix
While the remedy for your dissatisfaction with my previous solution should be obvious, I will make it explicit. # that is rows containing at most one value > 0.8 # ignoring the diagonal keeprows<-apply(ro246,1,function(x) return(sum(x>0.8)<2)) ro246.lt.8<-ro246[keeprows,keeprows] Jim
Peter Langfelder
2019-Nov-16 02:01 UTC
[R] Remove highly correlated variables from a data frame or matrix
Try hclust(as.dist(1-calc.rho), method = "average"). Peter On Fri, Nov 15, 2019 at 10:02 AM Ana Marija <sokovic.anamarija at gmail.com> wrote:> > HI Peter, > > Thank you for getting back to me and shedding light on this. I see > your point, doing Jim's method: > > > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > > ro246.lt.8<-calc.rho[keeprows,keeprows] > > ro246.lt.8[ro246.lt.8 == 1] <- NA > > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE)) > [1] 0.566 > > Which is good in general, correlations in my matrix should not be > exceeding 0.8. I need to run Mendelian Rendomization on it later on so > I can not be having there highly correlated SNPs. But with Jim's > method I am only left with 17 SNPs (out of 246) and that means that > both pairs of highly correlated SNPs are removed and it would be good > to keep one of those highly correlated ones. > > I tried to do your code: > > tree = hclust(1-calc.rho, method = "average") > Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor > exceed 65536") : > missing value where TRUE/FALSE needed > > Please advise. > > Thanks > Ana > > On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder > <peter.langfelder at gmail.com> wrote: > > > > I suspect that you want to identify which variables are highly > > correlated, and then keep only "representative" variables, i.e., > > remove redundant ones. This is a bit of a risky procedure but I have > > done such things before as well sometimes to simplify large sets of > > highly related variables. If your threshold of 0.8 is approximate, you > > could simply use average linkage hierarchical clustering with > > dissimilarity = 1-correlation, cut the tree at the appropriate height > > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > > the one with the highest mean correlation with other members of the > > cluster). Something along these lines (untested) > > > > tree = hclust(1-calc.rho, method = "average") > > clusts = cutree(tree, h = 0.2) > > clustLevels = sort(unique(clusts)) > > representatives = unlist(lapply(clustLevels, function(cl) > > { > > inClust = which(clusts==cl); > > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > > repr = inClust[ which.max(colSums(rho1)) ] > > repr > > })) > > > > the variable representatives now contains indices of the variables you > > want to retain, so you could subset the calc.rho matrix as > > rho.retained = calc.rho[representatives, representatives] > > > > I haven't tested the code and it may contain bugs, but something along > > these lines should get you where you want to be. > > > > Oh, and depending on how strict you want to be with the remaining > > correlations, you could use complete linkage clustering (will retain > > more variables, some correlations will be above 0.8). > > > > Peter > > > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > > > Hello, > > > > > > I have a data frame like this (a matrix): > > > head(calc.rho) > > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > > > dim(calc.rho) > > > [1] 246 246 > > > > > > I would like to remove from this data all highly correlated variables, > > > with correlation more than 0.8 > > > > > > I tried this: > > > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > > dim(data) > > > [1] 246 0 > > > > > > Can you please advise, > > > > > > Thanks > > > Ana > > > > > > But this removes everything. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2019-Nov-16 16:10 UTC
[R] Remove highly correlated variables from a data frame or matrix
Hi Peter, Thank you so much!!! I will use complete linkage clustering because Mendelian Randomization function (https://cran.r-project.org/web/packages/MendelianRandomization/vignettes/Vignette_MR.pdf) I plan to use allows for correlations but not as high as 0.9 or more. I got 40 SNPs out of 246 so improvement! Regards, Ana On Fri, Nov 15, 2019 at 8:01 PM Peter Langfelder <peter.langfelder at gmail.com> wrote:> > Try hclust(as.dist(1-calc.rho), method = "average"). > > Peter > > On Fri, Nov 15, 2019 at 10:02 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > HI Peter, > > > > Thank you for getting back to me and shedding light on this. I see > > your point, doing Jim's method: > > > > > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > > > ro246.lt.8<-calc.rho[keeprows,keeprows] > > > ro246.lt.8[ro246.lt.8 == 1] <- NA > > > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE)) > > [1] 0.566 > > > > Which is good in general, correlations in my matrix should not be > > exceeding 0.8. I need to run Mendelian Rendomization on it later on so > > I can not be having there highly correlated SNPs. But with Jim's > > method I am only left with 17 SNPs (out of 246) and that means that > > both pairs of highly correlated SNPs are removed and it would be good > > to keep one of those highly correlated ones. > > > > I tried to do your code: > > > tree = hclust(1-calc.rho, method = "average") > > Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor > > exceed 65536") : > > missing value where TRUE/FALSE needed > > > > Please advise. > > > > Thanks > > Ana > > > > On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder > > <peter.langfelder at gmail.com> wrote: > > > > > > I suspect that you want to identify which variables are highly > > > correlated, and then keep only "representative" variables, i.e., > > > remove redundant ones. This is a bit of a risky procedure but I have > > > done such things before as well sometimes to simplify large sets of > > > highly related variables. If your threshold of 0.8 is approximate, you > > > could simply use average linkage hierarchical clustering with > > > dissimilarity = 1-correlation, cut the tree at the appropriate height > > > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > > > the one with the highest mean correlation with other members of the > > > cluster). Something along these lines (untested) > > > > > > tree = hclust(1-calc.rho, method = "average") > > > clusts = cutree(tree, h = 0.2) > > > clustLevels = sort(unique(clusts)) > > > representatives = unlist(lapply(clustLevels, function(cl) > > > { > > > inClust = which(clusts==cl); > > > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > > > repr = inClust[ which.max(colSums(rho1)) ] > > > repr > > > })) > > > > > > the variable representatives now contains indices of the variables you > > > want to retain, so you could subset the calc.rho matrix as > > > rho.retained = calc.rho[representatives, representatives] > > > > > > I haven't tested the code and it may contain bugs, but something along > > > these lines should get you where you want to be. > > > > > > Oh, and depending on how strict you want to be with the remaining > > > correlations, you could use complete linkage clustering (will retain > > > more variables, some correlations will be above 0.8). > > > > > > Peter > > > > > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > > > > > Hello, > > > > > > > > I have a data frame like this (a matrix): > > > > head(calc.rho) > > > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > > > > > dim(calc.rho) > > > > [1] 246 246 > > > > > > > > I would like to remove from this data all highly correlated variables, > > > > with correlation more than 0.8 > > > > > > > > I tried this: > > > > > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > > > dim(data) > > > > [1] 246 0 > > > > > > > > Can you please advise, > > > > > > > > Thanks > > > > Ana > > > > > > > > But this removes everything. > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, reproducible code.