Rita Carreira
2011-Apr-15 22:00 UTC
[R] Function for deleting variables with >=50% missing obs from a data frame
Hello R users! I have several data frames where some of the variables have many missing observations. For example, Q1 in one of my data frames has over 66% of its observations missing. I have tried imputation with mice but it does not work for all the data frames and I get the following message or a similar message to this: iter imp variable 1 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q19 Q36 Q47 Q52 Q79 Q80 Q94 Q97 Q104 Q108 Q122 Q131 Q134 P1 P2 P3 P4 P5 P6Error in solve.default(xtx + diag(pen)) : system is computationally singular: reciprocal condition number = 1.83044e-16 In addition: Warning messages: 1: In sqrt((sum(residuals^2))/(sum(ry) - ncol(x) - 1)) : NaNs produced ... 7: In sqrt((sum(residuals^2))/(sum(ry) - ncol(x) - 1)) : NaNs produced Note: warnings 2 to 6 suppressed by me. I would like to try a different approach where I delete the variables that have more than 50% missing observations from the data frame (well, the actual percentage might change). I have already deleted from the data frame the variables that were all missing and for this I used the following code, which was kindly suggested by one of you: ## Data frame after removing any blank columns:dfQ <- dfQtemp[ , sapply(dfQtemp, function(x) !all(is.na(x)))] Any ideas or suggestons for deleting variables with partially missing data? Thanks and have a great weekend! Rita ===================================== "If you think education is expensive, try ignorance."--Derek Bok [[alternative HTML version deleted]]
Ben Bolker
2011-Apr-15 22:13 UTC
[R] Function for deleting variables with >=50% missing obs from a data frame
Rita Carreira <ritacarreira <at> hotmail.com> writes:> I have several data frames where some of the variables have many > missing observations. For example, Q1 in > one of my data frames has over 66% of its observations missing. > I have tried imputation with mice but it does > not work for all the data frames and I get the following > message or a similar message to this: >How about missing_prop <- sapply(orig_data,function(x) { mean(is.na(x)) }) good_data <- orig_data[missing_prop>0.5] (untested)
Rita Carreira
2011-Apr-18 21:48 UTC
[R] Function for deleting variables with >=50% missing obs from a data frame
Thanks for the suggestion Daryl! I did have to include the exclamation point before mean, otherwise it selected the columns with the most missing observations. But it was really nice to see this flexibility in R. So my fix was dfQ<- dfQtemp[ , sapply(dfQtemp, function(x) !mean(is.na(x))>.6)] Thanks again! Rita ===================================== "If you think education is expensive, try ignorance."--Derek Bok> Date: Fri, 15 Apr 2011 15:08:29 -0700 > From: darylm@uw.edu > To: ritacarreira@hotmail.com > Subject: Re: [R] Function for deleting variables with >=50% missing obs from a data frame > > you could simply modify > > !all(is.na(x)) > > to > > mean(is.na(x))> .6 > > or some such, or invert the logic if I have it backwards. > > .6 was the fraction greater than which we omit the data. > > > > > On 4/15/11 3:00 PM, Rita Carreira wrote: > > Hello R users! > > I have several data frames where some of the variables have many missing observations. For example, Q1 in one of my data frames has over 66% of its observations missing. I have tried imputation with mice but it does not work for all the data frames and I get the following message or a similar message to this: > > iter imp variable > > 1 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q19 Q36 Q47 Q52 Q79 Q80 Q94 Q97 Q104 Q108 Q122 Q131 Q134 P1 P2 P3 P4 P5 P6Error in solve.default(xtx + diag(pen)) : > > system is computationally singular: reciprocal condition number = 1.83044e-16 > > In addition: Warning messages: > > 1: In sqrt((sum(residuals^2))/(sum(ry) - ncol(x) - 1)) : NaNs produced > > ... > > 7: In sqrt((sum(residuals^2))/(sum(ry) - ncol(x) - 1)) : NaNs produced > > Note: warnings 2 to 6 suppressed by me. > > I would like to try a different approach where I delete the variables that have more than 50% missing observations from the data frame (well, the actual percentage might change). I have already deleted from the data frame the variables that were all missing and for this I used the following code, which was kindly suggested by one of you: > > ## Data frame after removing any blank columns:dfQ<- dfQtemp[ , sapply(dfQtemp, function(x) !all(is.na(x)))] > > Any ideas or suggestons for deleting variables with partially missing data? > > Thanks and have a great weekend! > > Rita ===================================== "If you think education is expensive, try ignorance."--Derek Bok > > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]