Ebert,Timothy Aaron
2022-Jul-14 18:11 UTC
[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)
Maybe this is too simple but could you use the select() function from dplyr? Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter Sent: Thursday, July 14, 2022 2:10 PM To: Ian McPhail <ivmcphail at gmail.com> Cc: R-help <r-help at r-project.org> Subject: Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500) [External Email] If I understand your query correctly, you can use negative indexing to omit variables. See ?'[' for details.> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d = > letters[5:7]) data b c d 1 1 a 4 e 2 2 b 5 f 3 3 c 6 g> dat[,-c(2,4)]a c 1 1 4 2 2 5 3 3 6 Of course you have to know the numerical index of the columns you wish to omit, but somethingh of the sort seems unavoidable in any case. Cheers, Bert On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail <ivmcphail at gmail.com> wrote:> > Hello, > > I am looking for some advice on how to select subsets of variables for > imputing when using the mice package. > > From Van Buuren's original mice paper, I see that selecting variables > to be 'skipped' in an imputation can be written as: > > ini <- mice(nhanes2, maxit = 0, print = FALSE) pred <- ini$pred pred[, > "bmi"] <- 0 meth <- ini$meth meth["bmi"] <- "" > > With the last two lines specifying the the "bmi" variable gets skipped > over and not imputed. > > And I have come across other examples, but all that I have seen lay > out a method of skipping variables where EVERY variable is named (as > "bmi" is named above). I am wondering if there is a reasonably easy > way to select out approximately 30 variables for imputation from a > larger dataset with around 2500 variables, without having to name all 2450+ other variables. > > Thank you, > > Ian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail > man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs > Rzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2e > wcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&e> PLEASE do read the posting guide > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or > g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA > sRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2 > ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&e> and provide commented, minimal, self-contained, reproducible code.______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&ePLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&eand provide commented, minimal, self-contained, reproducible code.
@vi@e@gross m@iii@g oii gm@ii@com
2022-Jul-15 01:38 UTC
[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)
Tim, Your reply is reasonable if you want to read in EVERYTHING and use various nice features of the select() function in the dplyr package of the tidyverse that let you exclude a bunch of columns based on names starting or ending or containing various characters or not being of type integer and so on. But another category wants to skip creating some columns in the first place. Many reader functions that take in data from something like a .CSV file will allow you to effectively ignore some of the columns of data and thus hopefully cut down on some overhead. I assume most of us have no real experience with the package called "mice" and who is willing to read to page 72 or so in this document: https://cran.r-project.org/web/packages/mice/mice.pdf Anywho, the mice() function this person wants to use has arguments meant to control what is brought in and stored in whatever internal format as in not taking some rows. A cursory glance suggests no way to suppress columns other than not including them before calling the function as it does not read the data from a file and expects either a data.frame or a matrix. So your answer is valid. The questioner can use any method they wish to adjust the initial data.frame and create a partial copy to use. If they want a small subset of 2500+ columns (and who wouldn't) then it may be easiest to simply name them in base R or select as in: New.df <- Old.df(, c("col36", "col89", "hike")) On the other hand, if they merely want to exclude lots of columns that have something in common, yes, select() allows things like: New.df <- Select(Old.df, -ends_with(c("extra", "comment")) The tidyverse keeps being rewritten so some new ways may be replacing old, but there are variants like select_if() that allow arbitrary functions to decide what columns to include/exclude such as based on what type they contain So the key is to trick before calling the function but leave in everything needed. Only the one asking the question knows what all the columns mean and what rhyme or reasons decides which to keep or exclude. A more specific question may get a more specific answer. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Ebert,Timothy Aaron Sent: Thursday, July 14, 2022 2:12 PM To: Bert Gunter <bgunter.4567 at gmail.com>; Ian McPhail <ivmcphail at gmail.com> Cc: R-help <r-help at r-project.org> Subject: Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500) Maybe this is too simple but could you use the select() function from dplyr? Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter Sent: Thursday, July 14, 2022 2:10 PM To: Ian McPhail <ivmcphail at gmail.com> Cc: R-help <r-help at r-project.org> Subject: Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500) [External Email] If I understand your query correctly, you can use negative indexing to omit variables. See ?'[' for details.> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d > letters[5:7]) data b c d 1 1 a 4 e 2 2 b 5 f 3 3 c 6 g> dat[,-c(2,4)]a c 1 1 4 2 2 5 3 3 6 Of course you have to know the numerical index of the columns you wish to omit, but somethingh of the sort seems unavoidable in any case. Cheers, Bert On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail <ivmcphail at gmail.com> wrote:> > Hello, > > I am looking for some advice on how to select subsets of variables for > imputing when using the mice package. > > From Van Buuren's original mice paper, I see that selecting variables > to be 'skipped' in an imputation can be written as: > > ini <- mice(nhanes2, maxit = 0, print = FALSE) pred <- ini$pred pred[, > "bmi"] <- 0 meth <- ini$meth meth["bmi"] <- "" > > With the last two lines specifying the the "bmi" variable gets skipped > over and not imputed. > > And I have come across other examples, but all that I have seen lay > out a method of skipping variables where EVERY variable is named (as > "bmi" is named above). I am wondering if there is a reasonably easy > way to select out approximately 30 variables for imputation from a > larger dataset with around 2500 variables, without having to name all2450+ other variables.> > Thank you, > > Ian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail > man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs > Rzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2e > wcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&e> PLEASE do read the posting guide > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or > g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA > sRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2 > ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&e> and provide commented, minimal, self-contained, reproducible code.______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_li stinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m =UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=ABj_L_b5 15lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&ePLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_post ing-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g& m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=LiocKPL Ygq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&eand provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.