Ana Marija
2019-Nov-14 20:42 UTC
[R] Remove highly correlated variables from a data frame or matrix
it can be converted between data frame and matrix. I am attaching here the whole file for examination I basically want to remove all entries for pairs which have value in between them (correlation calculated not in R, bit it is correlation, r2) so for example I would not keep: rs883504 because it has r2>0.8 for all those rs... rs8069610 rs883504 rs8072394 rs4280293 rs4465638 rs12602378 rs56192520 0.582 0.903 0.582 0.582 0.811 0.302 rs3764410 0.598 0.928 0.598 0.598 0.836 0.311 rs145984817 0.638 0.975 0.638 0.638 0.879 0.344 rs1807401 0.638 0.975 0.638 0.638 0.879 0.344 rs1807402 0.638 0.975 0.638 0.638 0.879 0.344 rs35350506 0.638 0.975 0.638 0.638 0.879 0.344 On Thu, Nov 14, 2019 at 2:29 PM Abby Spurdle <spurdle.a at gmail.com> wrote:> > Sorry, but I don't understand your question. > > When I first looked at this, I thought it was a correlation (or > covariance) matrix. > e.g. > > > cor (quakes) > > cov (quakes) > > However, your row and column variables are different, implying two > different data sets. > Also, some of the (correlation?) coefficients are the same, implying > that some of the variables are the same, or very close. > > Also, note that a matrix is not a data.frame. > > > > I have a data frame like this (a matrix): > > head(calc.rho) > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > dim(calc.rho) > > [1] 246 246 > > I would like to remove from this data all highly correlated variables, > > with correlation more than 0.8-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ro246_matrix.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20191114/2577162a/attachment.txt>
Abby Spurdle
2019-Nov-14 20:56 UTC
[R] Remove highly correlated variables from a data frame or matrix
> I basically want to remove all entries for pairs which have value in > between them (correlation calculated not in R, bit it is correlation, > r2) > so for example I would not keep: rs883504 because it has r2>0.8 for > all those rs...I'm still not sure what "remove all entries" means? In your example rs883504, has all correlation coefficients > 0.8, in the data returned by head(). However, most of its correlation coefficients are < 0.8, if you include the entire matrix. If you remove a variable that has at least one correlation coefficient> 0.8, you would remove all the variables.However, if you remove a variable that has all correlation coefficients > 0.8, you would (probably) remove no variables.