Rita Carreira
2011-Apr-19 19:10 UTC
[R] Subsetting a data frame by dropping correlated variables
Hello R Users! I have a data frame that has many variables, some with missing observations, and some that are correlated with each other. I would like to subset the data by dropping one of the variables that is correlated with another variable that I will keep int he data frame. Alternatively, I could also drop both the variables that are correlated with each other. Worry not! I am not deleting data, I am just finding a subset of the data that I can use to impute some missing observations. I have tried the following statement dfQuc <- dfQ[ , sapply(dfQ, function(x) cor(dfQ, use = "pairwise.complete.obs", method ="pearson")<0.8)] but it gives me the following error: Error in `[.data.frame`(dfQ, , sapply(dfQ, function(x) cor(dfQ, use = "pairwise.complete.obs", : undefined columns selected Since I have several dozen data frames, it is impractical for me to manually inspect the correlation matrices and select which variables to drop, so I am trying to have R make the selection for me. Does any one have any idea on how to accomplish this? Thank you very much! Rita ===================================== "If you think education is expensive, try ignorance."--Derek Bok [[alternative HTML version deleted]]
Juliet Hannah
2011-Apr-28 02:33 UTC
[R] Subsetting a data frame by dropping correlated variables
The 'findCorrelation' function in the caret package may be helpful. On Tue, Apr 19, 2011 at 3:10 PM, Rita Carreira <ritacarreira at hotmail.com> wrote:> > Hello R Users! > I have a data frame that has many variables, some with missing observations, and some that are correlated with each other. I would like to subset the data by dropping one of the variables that is correlated with another variable that I will keep int he data frame. Alternatively, I could also drop both the variables that are correlated with each other. Worry not! I am not deleting data, I am just finding a subset of the data that I can use to impute some missing observations. > I have tried the following statement > dfQuc <- dfQ[ , sapply(dfQ, function(x) cor(dfQ, use = "pairwise.complete.obs", method ="pearson")<0.8)] > but it gives me the following error: > Error in `[.data.frame`(dfQ, , sapply(dfQ, function(x) cor(dfQ, use = "pairwise.complete.obs", ?: > ?undefined columns selected > Since I have several dozen data frames, it is impractical for me to manually inspect the correlation matrices and select which variables to drop, so I am trying to have R make the selection for me. Does any one have any idea on how to accomplish this? > Thank you very much! > Rita ===================================== "If you think education is expensive, try ignorance."--Derek Bok > > > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Maybe Matching Threads
- Function for deleting variables with >=50% missing obs from a data frame
- What does class "call" mean? How do I make class "formula" into a "call"?
- How do I delete multiple blank variables from a data frame?
- Loop in variable names
- df with max function applied to 6 lags of a variable?!?