Hi, I am a long time SPSS user but new to R, so please bear with me if my questions seem to be too basic for you guys. I am trying to figure out how to analyze survey data using logistic regression with multiple imputation. I have a survey data of about 200,000 cases and I am trying to predict the odds ratio of a dependent variable using 6 categorical independent variables (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data in one or more of the independent variables. The percentage of missing ranges from 0.01% to 10% for the independent variables. My current thinking is to conduct a logistic regression with multiple imputation, but I don't know how to do it in R. I searched the web but couldn't find instructions or examples on how to do this. Since SPSS is hopeless with missing data, I have to learn to do this in R. I am new to R, so I would really appreciate if someone can show me some examples or tell me where to find resources. Thank you! Daniel [[alternative HTML version deleted]]
Hi Daniel First, newer versions of SPSS have dramatically improved their ability to do stuff with missing data - I believe it's an additional module, and in SPSS-world, each additional module = $$$. Analyzing missing data is a 3 step process. First, you impute, creating multiple datasets, then you analyze each dataset in the conventional way, then you combine the results. There are two (that I know of) packages for imputaton - these are mi and mice. rseek.org will find them for you. Hope that helps, Jeremy On 29 June 2010 22:14, Daniel Chen <news at pushih.com> wrote:> Hi, > > I am a long time SPSS user but new to R, so please bear with me if my > questions seem to be too basic for you guys. > > I am trying to figure out how to analyze survey data using logistic > regression with multiple imputation. > > I have a survey data of about 200,000 cases and I am trying to predict the > odds ratio of a dependent variable using 6 categorical independent variables > (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data > in one or more of the independent variables. The percentage of missing > ranges from 0.01% to 10% for the independent variables. > > My current thinking is to conduct a logistic regression with multiple > imputation, but I don't know how to do it in R. I searched the web but > couldn't find instructions or examples on how to do this. Since SPSS is > hopeless with missing data, I have to learn to do this in R. I am new to R, > so I would really appreciate if someone can show me some examples or tell me > where to find resources. > > Thank you! > > Daniel > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jeremy Miles Psychology Research Methods Wiki: www.researchmethodsinpsychology.com
On 6/30/2010 1:14 AM, Daniel Chen wrote:> Hi, > > I am a long time SPSS user but new to R, so please bear with me if my > questions seem to be too basic for you guys. > > I am trying to figure out how to analyze survey data using logistic > regression with multiple imputation. > > I have a survey data of about 200,000 cases and I am trying to predict the > odds ratio of a dependent variable using 6 categorical independent variables > (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data > in one or more of the independent variables. The percentage of missing > ranges from 0.01% to 10% for the independent variables. > > My current thinking is to conduct a logistic regression with multiple > imputation, but I don't know how to do it in R. I searched the web but > couldn't find instructions or examples on how to do this. Since SPSS is > hopeless with missing data, I have to learn to do this in R. I am new to R, > so I would really appreciate if someone can show me some examples or tell me > where to find resources.Here is an example using the Amelia package to generate imputations and the mitools and mix packages to make the pooled inferences. titanic <- read.table("http://lib.stat.cmu.edu/S/Harrell/data/ascii/titanic.txt", sep=',', header=TRUE) set.seed(4321) titanic$sex[sample(nrow(titanic), 10)] <- NA titanic$pclass[sample(nrow(titanic), 10)] <- NA titanic$survived[sample(nrow(titanic), 10)] <- NA library(Amelia) # generate multiple imputations library(mitools) # for MIextract() library(mix) # for mi.inference() titanic.amelia <- amelia(subset(titanic, select=c('survived','pclass','sex','age')), m=10, noms=c('survived','pclass','sex'), emburn=c(500,500)) allimplogreg <- lapply(titanic.amelia$imputations, function(x){glm(survived ~ pclass + sex + age, family=binomial, data = x)}) mice.betas.glm <- MIextract(allimplogreg, fun=function(x){coef(x)}) mice.se.glm <- MIextract(allimplogreg, fun=function(x){sqrt(diag(vcov(x)))}) as.data.frame(mi.inference(mice.betas.glm, mice.se.glm)) # Or using only mitools for pooled inference betas <- MIextract(allimplogreg, fun=coef) vars <- MIextract(allimplogreg, fun=vcov) summary(MIcombine(betas,vars))> Thank you! > > Daniel > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Chuck Cleland, Ph.D. NDRI, Inc. (www.ndri.org) 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 512-0171 (M, W, F) fax: (917) 438-0894
On Jun 30, 2010, at 1:14 AM, Daniel Chen wrote:> Hi, > > I am a long time SPSS user but new to R, so please bear with me if my > questions seem to be too basic for you guys. > > I am trying to figure out how to analyze survey data using logistic > regression with multiple imputation. > > I have a survey data of about 200,000 cases and I am trying to > predict the > odds ratio of a dependent variable using 6 categorical independent > variables > (dummy-coded). Approximatively 10% of the cases (~20,000) have > missing data > in one or more of the independent variables. The percentage of missing > ranges from 0.01% to 10% for the independent variables. > > My current thinking is to conduct a logistic regression with multiple > imputation, but I don't know how to do it in R. I searched the web but > couldn't find instructions or examples on how to do this. Since SPSS > is > hopeless with missing data, I have to learn to do this in R. I am > new to R, > so I would really appreciate if someone can show me some examples or > tell me > where to find resources.The rms/Hmisc duo of packages has several functions supporting multiple imputation. aregImpute() is nicely integrated with his other utility functions and extensively documented in Harrell's excellent text: "Regression Modeling Strategies". He also provides quite a bit of free, online documentation at his Vanderbilt website. The help page for aregImpute is a small chapter in itself with multiple worked examples. install.packages(c("rms", "Hmisc") reauire(rms) # rms has dependecy of Hmisc which will load automagically ?aregImpute -- David Winsemius> > Thank you! > > Daniel > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT