Gary Collins
2008-Aug-21 19:35 UTC
[R] replacing missing values in a dataframe with reference values.
Any thoughts on the following I'd be most grateful - I'm sure there is an easy and quick way to do this but I'm having a mental block this evening. Essentially, I'm trying to replace missing data in my dataset with reference values based on age and sex. So an example dataset is set.seed(1) X = data.frame(age=rnorm(10, 50, 10), sex=rbinom(10, 1, 0.5), A=rnorm(10), B=rnorm(10)) X$agegroup = cut(X$age, breaks=c(20, 30, 40, 50, 60, 70, 80), labels = c("20-29", "30-39", "40-49", "50-59", "60-69", "70-79")) # make some missing X$A[c(1, 5, 6)]=NA X$B[c(2, 3, 8)]=NA and my reference dataset is refval = data.frame(agegroup = rep(c("20-29", "30-39", "40-49", "50-59", "60-69", "70-79"), 2), sex = c(rep(0,6), rep(1,6)),A = c(seq(0, 1, length=6), seq(4,5,length=6)), B = c(seq(1, 2, length=6), seq(3,4,length=6))) My ugly "solution" is for(i in 1:nrow(X)){ if(is.na(X$A[i])){ X$A[i] = refval$A[refval$sex == X$sex[i] & refval$agegroup == X$agegroup[i]] } if(is.na(X$B[i])){ X$B[i] = refval$B[refval$sex == X$sex[i] & refval$agegroup == X$agegroup[i]] } } Which not only is this ugly, but is slow, in that my actual dataset is over 1 million rows with 5 or 6 variables to check and replace missing values. Thanks in advance for any help. Gary ------------------------------------ Dr Gary S Collins Medical Statistician Centre for Statistics in Medicine Wolfson College Annexe University of Oxford Linton Road Oxford, OX2 6UD Tel: +44 (0)1865 284418 Fax: +44 (0)1865 284424 www.csm-oxford.org.uk
Jim Regetz
2008-Aug-22 00:29 UTC
[R] replacing missing values in a dataframe with reference values.
Gary Collins wrote:> Any thoughts on the following I'd be most grateful - I'm sure there is > an easy and quick way to do this but I'm having a mental block this > evening. Essentially, I'm trying to replace missing data in my dataset > with reference values based on age and sex. > > So an example dataset is > set.seed(1) > X = data.frame(age=rnorm(10, 50, 10), sex=rbinom(10, 1, 0.5), > A=rnorm(10), B=rnorm(10)) > > X$agegroup = cut(X$age, breaks=c(20, 30, 40, 50, 60, 70, 80), labels = > c("20-29", "30-39", "40-49", "50-59", "60-69", "70-79")) > > # make some missing > X$A[c(1, 5, 6)]=NA > X$B[c(2, 3, 8)]=NA > > and my reference dataset is > refval = data.frame(agegroup = rep(c("20-29", "30-39", "40-49", "50-59", > "60-69", "70-79"), 2), sex = c(rep(0,6), rep(1,6)),A = c(seq(0, 1, > length=6), seq(4,5,length=6)), B = c(seq(1, 2, length=6), > seq(3,4,length=6)))Especially if you're certain every combination of agegroup and sex in X has a matching entry in the refval data frame, I would just create a concatenated key from your two reference variables, and then use that to do the lookups using rownames: rownames(refval) <- paste(refval$sex, refval$agegroup, sep=".") X.key <- paste(X$sex, X$agegroup, sep=".") X$A[is.na(X$A)] <- refval[X.key[is.na(X$A)], "A"] X$B[is.na(X$B)] <- refval[X.key[is.na(X$B)], "B"] etc. HTH, Jim> My ugly "solution" is > > for(i in 1:nrow(X)){ > if(is.na(X$A[i])){ > X$A[i] = refval$A[refval$sex == X$sex[i] & refval$agegroup == > X$agegroup[i]] > } > if(is.na(X$B[i])){ > X$B[i] = refval$B[refval$sex == X$sex[i] & refval$agegroup == > X$agegroup[i]] > } > } > > Which not only is this ugly, but is slow, in that my actual dataset is > over 1 million rows with 5 or 6 variables to check and replace missing > values. > > Thanks in advance for any help. > > Gary > ------------------------------------ > Dr Gary S Collins > Medical Statistician > Centre for Statistics in Medicine > Wolfson College Annexe > University of Oxford > Linton Road > Oxford, OX2 6UD > > Tel: +44 (0)1865 284418 > Fax: +44 (0)1865 284424 > www.csm-oxford.org.uk > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Apparently Analagous Threads
- passing formula arguments cv.glm
- How can I assign an argument to transfer whether by ref or by value?
- OLS variables
- [PATCH] nv50/ir: use unordered_set instead of list to keep our instructions in uses
- Reading in tab (and space) delimited data within a script XXXX