Ian McPhail
2022-Jul-14 17:59 UTC
[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)
Hello, I am looking for some advice on how to select subsets of variables for imputing when using the mice package.>From Van Buuren's original mice paper, I see that selecting variables to be'skipped' in an imputation can be written as: ini <- mice(nhanes2, maxit = 0, print = FALSE) pred <- ini$pred pred[, "bmi"] <- 0 meth <- ini$meth meth["bmi"] <- "" With the last two lines specifying the the "bmi" variable gets skipped over and not imputed. And I have come across other examples, but all that I have seen lay out a method of skipping variables where EVERY variable is named (as "bmi" is named above). I am wondering if there is a reasonably easy way to select out approximately 30 variables for imputation from a larger dataset with around 2500 variables, without having to name all 2450+ other variables. Thank you, Ian [[alternative HTML version deleted]]
Bert Gunter
2022-Jul-14 18:09 UTC
[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)
If I understand your query correctly, you can use negative indexing to omit variables. See ?'[' for details.> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d = letters[5:7]) > data b c d 1 1 a 4 e 2 2 b 5 f 3 3 c 6 g> dat[,-c(2,4)]a c 1 1 4 2 2 5 3 3 6 Of course you have to know the numerical index of the columns you wish to omit, but somethingh of the sort seems unavoidable in any case. Cheers, Bert On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail <ivmcphail at gmail.com> wrote:> > Hello, > > I am looking for some advice on how to select subsets of variables for > imputing when using the mice package. > > From Van Buuren's original mice paper, I see that selecting variables to be > 'skipped' in an imputation can be written as: > > ini <- mice(nhanes2, maxit = 0, print = FALSE) > pred <- ini$pred > pred[, "bmi"] <- 0 > meth <- ini$meth > meth["bmi"] <- "" > > With the last two lines specifying the the "bmi" variable gets skipped over > and not imputed. > > And I have come across other examples, but all that I have seen lay out a > method of skipping variables where EVERY variable is named (as "bmi" is > named above). I am wondering if there is a reasonably easy way to select > out approximately 30 variables for imputation from a larger dataset with > around 2500 variables, without having to name all 2450+ other variables. > > Thank you, > > Ian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Rui Barradas
2022-Jul-14 18:49 UTC
[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)
Hello, You can use mice() argument predictorMatrix to tell mice() which variables/blocks are used when imputing which column. If the column vector is set to zeros, no column or block will used in its imputation. library(mice) predmat <- matrix(1L, ncol(nhanes2), ncol(nhanes2), dimnames = list(names(nhanes2), names(nhanes2))) diag(predmat) <- 0L predmat[, "bmi"] <- 0L predmat #> age bmi hyp chl #> age 0 0 1 1 #> bmi 1 0 1 1 #> hyp 1 0 0 1 #> chl 1 0 1 0 Then use argument where to skip the variables you do not want imputed. Note that this is not the same as not being imputed according to variables shown above as rownames of predmat. The default of where is the matrix is.na(nhanes2) so make a copy of this matrix then set column "bmi" to FALSE. Then call mice(). predmat <- matrix(1L, ncol(nhanes2), ncol(nhanes2), dimnames = list(names(nhanes2), names(nhanes2))) diag(predmat) <- 0L predmat[, "bmi"] <- 0L predmat #> age bmi hyp chl #> age 0 0 1 1 #> bmi 1 0 1 1 #> hyp 1 0 0 1 #> chl 1 0 1 0 not_bmi <- is.na(nhanes2) not_bmi[, "bmi"] <- FALSE ini_all <- mice(nhanes2, print = FALSE) ini_bmi <- mice(nhanes2, predictorMatrix = predmat, where = not_bmi, print = FALSE) cmpl_all <- complete(ini_all) head(cmpl_all) #> age bmi hyp chl #> 1 20-39 28.7 no 187 #> 2 40-59 22.7 no 187 #> 3 20-39 30.1 no 187 #> 4 60-99 27.5 yes 284 #> 5 20-39 20.4 no 113 #> 6 60-99 20.4 no 184 cmpl_bmi <- complete(ini_bmi) head(cmpl_bmi) #> age bmi hyp chl #> 1 20-39 NA no 187 #> 2 40-59 22.7 no 187 #> 3 20-39 NA no 187 #> 4 60-99 NA yes 206 #> 5 20-39 20.4 no 113 #> 6 60-99 NA yes 184 Hope this helps, Rui Barradas ?s 18:59 de 14/07/2022, Ian McPhail escreveu:> Hello, > > I am looking for some advice on how to select subsets of variables for > imputing when using the mice package. > > From Van Buuren's original mice paper, I see that selecting variables to be > 'skipped' in an imputation can be written as: > > ini <- mice(nhanes2, maxit = 0, print = FALSE) > pred <- ini$pred > pred[, "bmi"] <- 0 > meth <- ini$meth > meth["bmi"] <- "" > > With the last two lines specifying the the "bmi" variable gets skipped over > and not imputed. > > And I have come across other examples, but all that I have seen lay out a > method of skipping variables where EVERY variable is named (as "bmi" is > named above). I am wondering if there is a reasonably easy way to select > out approximately 30 variables for imputation from a larger dataset with > around 2500 variables, without having to name all 2450+ other variables. > > Thank you, > > Ian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.