Coen van Hasselt
2010-Nov-19 15:34 UTC
[R] how to improve this inefficient R code for imputing missing values
Hello all, I have a big data.frame multiple studies, subjects and timepoints per subject, i.e. STUDY[,1] SUBJECT[,2] ...... WT[,16] HT[,17] TEMP[,18] BSA[,19] 1 1 50 170 37 1.90 1 1 NA NA NA NA 1 1 52 170 38 1.94 In this dataset, three types of missing (demographic) values exist: 1) first value for a subject is missing: ie. study 1, subject 1: mis X1 X2 X3. Here I want to carry the first non-missing value backwards to the missing value. 2) last values for a subject is missing: ie. study 1, subject 1: X1 X2 X3 mis. Here I want to carry the last non-missing value forwards to the missing value 3) some "intermediate" value for a subject is missing (like example data.frame above) i.e. study 1, subject 1: X1 mis X2 X3. Here I want to impute the missing value with the mean value between X1 and X2 The missing value is actually a subset of columns in the data frame, ie. always the columns WT HT TEMP BSA (m[,16:19]) are missing altogether. I have written some R code that tries to do this, but it is incredibly slow due to the many for-loops and the big dataset I have (and might not even be completely correct yet). QUESTION: I would greatly appreciate it if somebody can be give me some guidance/hints on what direction I should roughly think for coding the above a little more efficient then the horribly inefficient code pasted below. Thank you in advance and best regards, Coen for(s in unique(m$Study)){ # for each study for(i in unique(m$Subject[m$Study==s & is.na(m$Wt)])){ # for each subject with a missing value (if $Wt is missing, all 4 columns 16:19 are missing) vals<-which(m$Study==s & m$Subject==i & !is.na(m$Wt)) # values with NO missing values for(w in which(m$Study==s & m$Subject==i & is.na(m$Wt))){ # for each value that is missing for subject "i" and study "s" if(w < min(vals) ){ # FIRST VALUES MISSING ? # carry the backwards m[w,][16:19]<-m[min(vals),][16:19] } else if(w > max(vals) ) { # LAST VALUES MISSING # carry forwards m[w,][16:19]<-m[max(vals),][16:19] } else { # INTERMEDIATE VALUES MISSING # impute missing with mean maxV<-min(vals[vals>w]) minV<-max(vals[vals<w]) m[w,][16:19]<- mean(m[c(maxV,minV),][16:19],na.rm=T) } } } }
Petr PIKAL
2010-Nov-19 16:19 UTC
[R] Odp: how to improve this inefficient R code for imputing missing values
Hi r-help-bounces at r-project.org napsal dne 19.11.2010 16:34:04: Without going too deeply to your code, try to check na.locf function from zoo package. I would split your data to list according to study and subject, use na.locf with respect to your miising value types> x<-c(NA, 1:5) > y<-rev(x) > x[1] NA 1 2 3 4 5> y[1] 5 4 3 2 1 NA> z<-c(y,x) > z[1] 5 4 3 2 1 NA NA 1 2 3 4 5> lll<-list(x,y,z) > lll[[1]] [1] NA 1 2 3 4 5 [[2]] [1] 5 4 3 2 1 NA [[3]] [1] 5 4 3 2 1 NA NA 1 2 3 4 5 library(zoo) lapply(lll[unlist(lapply(lll, function(x) all(which(is.na(x))==1)))], na.locf, fromLast=TRUE) lapply(lll[unlist(lapply(lll, function(x) all(which(is.na(x))==length(x))))], na.locf) The third you shall probably do without na.locf, but I do not have clear idea how exactly. After that you will get three lists and you can put them together again. However I am not sure if it is the best way how to do what you want. Regards Peetr> [R] how to improve this inefficient R code for imputing missing values > > Hello all, > > I have a big data.frame multiple studies, subjects and timepoints per > subject, i.e. > > STUDY[,1] SUBJECT[,2] ...... WT[,16] HT[,17] TEMP[,18] BSA[,19] > 1 1 50 170 37 > 1.90 > 1 1 NA NA NA > NA > 1 1 52 170 38 > 1.94 > > > In this dataset, three types of missing (demographic) values exist: > > 1) first value for a subject is missing: > ie. study 1, subject 1: mis X1 X2 X3. > Here I want to carry the first non-missing value backwards to themissing value.> > 2) last values for a subject is missing: > ie. study 1, subject 1: X1 X2 X3 mis. > Here I want to carry the last non-missing value forwards to the missingvalue> > 3) some "intermediate" value for a subject is missing (like example > data.frame above) > i.e. study 1, subject 1: X1 mis X2 X3. > Here I want to impute the missing value with the mean value between X1and X2> > The missing value is actually a subset of columns in the data frame, > ie. always the columns WT HT TEMP BSA (m[,16:19]) are missing > altogether. > > I have written some R code that tries to do this, but it is incredibly > slow due to the many for-loops and the big dataset I have (and might > not even be completely correct yet). > > QUESTION: > I would greatly appreciate it if somebody can be give me some > guidance/hints on what direction I should roughly think for coding the > above a little more efficient then the horribly inefficient code > pasted below. > > Thank you in advance and best regards, > > Coen > > > for(s in unique(m$Study)){ # for each study > for(i in unique(m$Subject[m$Study==s & is.na(m$Wt)])){ # for each > subject with a missing value (if $Wt is missing, all 4 columns 16:19 > are missing) > vals<-which(m$Study==s & m$Subject==i & !is.na(m$Wt)) # values > with NO missing values > for(w in which(m$Study==s & m$Subject==i & is.na(m$Wt))){ # for > each value that is missing for subject "i" and study "s" > if(w < min(vals) ){ # FIRST VALUES MISSING ? # > carry the backwards > m[w,][16:19]<-m[min(vals),][16:19] > } else if(w > max(vals) ) { # LAST VALUES MISSING # > carry forwards > m[w,][16:19]<-m[max(vals),][16:19] > } else { # INTERMEDIATE VALUES MISSING # > impute missing with mean > maxV<-min(vals[vals>w]) > minV<-max(vals[vals<w]) > m[w,][16:19]<- mean(m[c(maxV,minV),][16:19],na.rm=T) > } > } > } > } > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.