D. Alain
2011-Feb-21 11:20 UTC
[R] Subset according to groups NA proportion within specific variables
Dear R-List, I have a dataframe with one grouping variable (x) and three response variables (y,z,w). df<-data.frame(x=c(rep(1,3),rep(2,4),rep(3,5)),y=rnorm(12),z=c(3,4,5,NA,NA,NA,NA,1,2,1,2,1),w=c(1,2,3,3,4,3,5,NA,5,NA,7,8))>dfx y z w 1 0.29306106 3 1 1 0.54797780 4 2 1 -1.38365548 5 3 2 -0.20407986 NA 3 2 -0.87322574 NA 4 2 -1.23356250 NA 3 2 0.43929374 NA 5 3 1.16405483 1 NA 3 1.07083464 2 5 3 -0.67463191 1 NA 3 -0.66410552 2 7 3 -0.02543358 1 8 Now I want to make a new dataframe df.sub comprising only cases pertaining to groups, where the overall proportion of NAs in either of the response variables y,z,w does not exceed 50%. In the above example, e.g., this would be a dataframe with all cases of the groups 1 and 3 (since there are 100% NAs in z for group 2)>df.subx y z w 1 0.29306106 3 1 1 0.54797780 4 2 1 -1.38365548 5 3 3 1.16405483 1 NA 3 1.07083464 2 5 3 -0.67463191 1 NA 3 -0.66410552 2 7 3 -0.02543358 1 8 Please excuse me if the problem has already been treated somewhere, but so far I was not able to find the right threat for my question in RSeek. Can anyone help? Thanks in advance! D. Alain [[alternative HTML version deleted]]
Karl Ove Hufthammer
2011-Feb-21 12:05 UTC
[R] Subset according to groups NA proportion within specific variables
D. Alain wrote:> Now I want to make a new dataframe df.sub comprising only cases pertaining > to groups, where the overall proportion of NAs in either of the response > variables y,z,w does not exceed 50%.One simple example: library(plyr) na.prop = function(x) data.frame(x, missing=nrow(na.omit(x))/nrow(x) ) newdf = ddply(df, .(x), na.prop) Now you can use ?subset? on ?newdf? to obtain the required rows. (For very large data sets it may be better to not create an entire data frame in ?na.prop?, duplicating the data in ?df?, but instead just return the proportion.) -- Karl Ove Hufthammer
Dennis Murphy
2011-Feb-21 12:14 UTC
[R] Subset according to groups NA proportion within specific variables
Hi: Here's one way with package plyr: df<-data.frame(x=c(rep(1,3),rep(2,4),rep(3,5)), y=rnorm(12), z=c(3,4,5,NA,NA,NA,NA,1,2,1,2,1), w=c(1,2,3,3,4,3,5,NA,5,NA,7,8)) library(plyr) fun <- function(d) { u <- apply(d[, -1], 2, function(y) sum(is.na(y)))/nrow(d) if(all(u <= 0.5)) return(d) } ddply(df, 'x', fun)> ddply(df, 'x', fun)x y z w 1 1 -1.22768415 3 1 2 1 0.03108696 4 2 3 1 0.90246871 5 3 4 3 -0.47387908 1 NA 5 3 1.59577665 2 5 6 3 -0.80792438 1 NA 7 3 0.20927614 2 7 8 3 -0.46172477 1 8 On Mon, Feb 21, 2011 at 3:20 AM, D. Alain <dialvac-r@yahoo.de> wrote:> Dear R-List, > > I have a dataframe with one grouping variable (x) and three response > variables (y,z,w). > > > df<-data.frame(x=c(rep(1,3),rep(2,4),rep(3,5)),y=rnorm(12),z=c(3,4,5,NA,NA,NA,NA,1,2,1,2,1),w=c(1,2,3,3,4,3,5,NA,5,NA,7,8)) > > >df > x y z w > 1 0.29306106 3 1 > 1 0.54797780 4 2 > 1 -1.38365548 5 3 > 2 -0.20407986 NA 3 > 2 -0.87322574 NA 4 > 2 -1.23356250 NA 3 > 2 0.43929374 NA 5 > 3 1.16405483 1 NA > 3 1.07083464 2 5 > 3 -0.67463191 1 NA > 3 -0.66410552 2 7 > 3 -0.02543358 1 8 > > Now I want to make a new dataframe df.sub comprising only cases pertaining > to > groups, where the overall proportion of NAs in either of the response > variables y,z,w does not exceed 50%. > > In the above example, e.g., this would be a dataframe with all cases of the > groups 1 and 3 (since there are 100% NAs in z for group 2) > > >df.sub > x y z w > 1 0.29306106 3 1 > 1 0.54797780 4 2 > 1 -1.38365548 5 3 > 3 1.16405483 1 NA > 3 1.07083464 2 5 > 3 -0.67463191 1 NA > 3 -0.66410552 2 7 > 3 -0.02543358 1 8 > > Please excuse me if the problem has already been treated somewhere, but so > far I was not able to find the right threat for my question in RSeek. > > Can anyone help? > > Thanks in advance! > > D. Alain > > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
Dimitris Rizopoulos
2011-Feb-21 12:23 UTC
[R] Subset according to groups NA proportion within specific variables
one way is the following: DF <- data.frame(x = c(rep(1,3),rep(2,4),rep(3,5)), y = rnorm(12), z = c(3,4,5,NA,NA,NA,NA,1,2,1,2,1), w = c(1,2,3,3,4,3,5,NA,5,NA,7,8) ) na.ind <- sapply(DF[-1], is.na) na.ind <- ave(na.ind, rep(DF$x, 3), col(na.ind)) < 0.5 DF[apply(na.ind, 1, all), ] I hope it helps. Best, Dimitris On 2/21/2011 12:20 PM, D. Alain wrote:> Dear R-List, > > I have a dataframe with one grouping variable (x) and three response variables (y,z,w). > > df<-data.frame(x=c(rep(1,3),rep(2,4),rep(3,5)),y=rnorm(12),z=c(3,4,5,NA,NA,NA,NA,1,2,1,2,1),w=c(1,2,3,3,4,3,5,NA,5,NA,7,8)) > >> df > x y z w > 1 0.29306106 3 1 > 1 0.54797780 4 2 > 1 -1.38365548 5 3 > 2 -0.20407986 NA 3 > 2 -0.87322574 NA 4 > 2 -1.23356250 NA 3 > 2 0.43929374 NA 5 > 3 1.16405483 1 NA > 3 1.07083464 2 5 > 3 -0.67463191 1 NA > 3 -0.66410552 2 7 > 3 -0.02543358 1 8 > > Now I want to make a new dataframe df.sub comprising only cases pertaining to > groups, where the overall proportion of NAs in either of the response variables y,z,w does not exceed 50%. > > In the above example, e.g., this would be a dataframe with all cases of the groups 1 and 3 (since there are 100% NAs in z for group 2) > >> df.sub > x y z w > 1 0.29306106 3 1 > 1 0.54797780 4 2 > 1 -1.38365548 5 3 > 3 1.16405483 1 NA > 3 1.07083464 2 5 > 3 -0.67463191 1 NA > 3 -0.66410552 2 7 > 3 -0.02543358 1 8 > > Please excuse me if the problem has already been treated somewhere, but so far I was not able to find the right threat for my question in RSeek. > > Can anyone help? > > Thanks in advance! > > D. Alain > > > > [[alternative HTML version deleted]] > > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 Web: http://www.erasmusmc.nl/biostatistiek/