thr3ads.net - R help - [R] how to improve this inefficient R code for imputing missing values [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Coen van Hasselt

2010-Nov-19 15:34 UTC

[R] how to improve this inefficient R code for imputing missing values

Hello all,

I have a big data.frame multiple studies, subjects and timepoints per
subject, i.e.

STUDY[,1] SUBJECT[,2] ...... WT[,16] HT[,17] TEMP[,18] BSA[,19]
1               1                            50          170       37
         1.90
1               1                            NA           NA      NA
        NA
1               1                            52          170       38
         1.94


In this dataset, three types of missing (demographic) values exist:

1) first value for a subject is missing:
ie. study 1, subject 1: mis X1 X2 X3.
Here I want to carry the first non-missing value backwards to the missing value.

2) last values for a subject is missing:
ie. study 1, subject 1: X1 X2 X3 mis.
Here I want to carry the last non-missing value forwards to the missing value

3) some "intermediate" value for a subject is missing (like example
data.frame above)
i.e. study 1, subject 1: X1 mis X2 X3.
Here I want to impute the missing value with the mean value between X1 and X2

The missing value is actually a subset of columns in the data frame,
ie. always the columns WT HT TEMP BSA (m[,16:19]) are missing
altogether.

I have written some R code that tries to do this, but it is incredibly
slow due to the many for-loops and the big dataset I have (and might
not even be completely correct yet).

QUESTION:
I would greatly appreciate it if somebody can be give me some
guidance/hints on what direction I should roughly think for coding the
above a little more efficient then the horribly inefficient code
pasted below.

Thank you in advance and best regards,

Coen


for(s in unique(m$Study)){ # for each study
 for(i in unique(m$Subject[m$Study==s & is.na(m$Wt)])){  # for each
subject with a missing value (if $Wt is missing, all 4 columns 16:19
are missing)
   vals<-which(m$Study==s & m$Subject==i & !is.na(m$Wt))  # values
with NO missing values
   for(w in which(m$Study==s & m$Subject==i & is.na(m$Wt))){  # for
each value that is missing for subject "i" and study "s"
     if(w < min(vals) ){              # FIRST VALUES MISSING ?       #
carry the backwards
       m[w,][16:19]<-m[min(vals),][16:19]
     } else if(w > max(vals) ) {      # LAST VALUES MISSING        #
carry forwards
       m[w,][16:19]<-m[max(vals),][16:19]
     } else {                         # INTERMEDIATE VALUES MISSING  #
impute missing with mean
       maxV<-min(vals[vals>w])
       minV<-max(vals[vals<w])
       m[w,][16:19]<- mean(m[c(maxV,minV),][16:19],na.rm=T)
     }
   }
 }
}

Petr PIKAL

2010-Nov-19 16:19 UTC

head link

[R] Odp: how to improve this inefficient R code for imputing missing values

Hi

r-help-bounces at r-project.org napsal dne 19.11.2010 16:34:04:

Without going too deeply to your code, try to check na.locf function from 
zoo package.

I would split your data to list according to study and subject, use 
na.locf with respect to your miising value types
> x<-c(NA, 1:5)
> y<-rev(x)
> x
[1] NA  1  2  3  4  5> y
[1]  5  4  3  2  1 NA> z<-c(y,x)
> z
 [1]  5  4  3  2  1 NA NA  1  2  3  4  5> lll<-list(x,y,z)
> lll[[1]]
[1] NA  1  2  3  4  5

[[2]]
[1]  5  4  3  2  1 NA

[[3]]
 [1]  5  4  3  2  1 NA NA  1  2  3  4  5


library(zoo)
lapply(lll[unlist(lapply(lll, function(x) all(which(is.na(x))==1)))], 
na.locf, fromLast=TRUE)
lapply(lll[unlist(lapply(lll, function(x) 
all(which(is.na(x))==length(x))))], na.locf)

The third you shall probably do without na.locf, but I do not have clear 
idea how exactly.

After that you will get three lists and you can put them together again. 
However I am not sure if it is the best way how to do what you want.

Regards
Peetr


> [R] how to improve this inefficient R code for imputing missing values
> 
> Hello all,
> 
> I have a big data.frame multiple studies, subjects and timepoints per
> subject, i.e.
> 
> STUDY[,1] SUBJECT[,2] ...... WT[,16] HT[,17] TEMP[,18] BSA[,19]
> 1               1                            50          170       37
>          1.90
> 1               1                            NA           NA      NA
>         NA
> 1               1                            52          170       38
>          1.94
> 
> 
> In this dataset, three types of missing (demographic) values exist:
> 
> 1) first value for a subject is missing:
> ie. study 1, subject 1: mis X1 X2 X3.
> Here I want to carry the first non-missing value backwards to the 
missing value.> 
> 2) last values for a subject is missing:
> ie. study 1, subject 1: X1 X2 X3 mis.
> Here I want to carry the last non-missing value forwards to the missing 
value> 
> 3) some "intermediate" value for a subject is missing (like
example
> data.frame above)
> i.e. study 1, subject 1: X1 mis X2 X3.
> Here I want to impute the missing value with the mean value between X1 
and X2> 
> The missing value is actually a subset of columns in the data frame,
> ie. always the columns WT HT TEMP BSA (m[,16:19]) are missing
> altogether.
> 
> I have written some R code that tries to do this, but it is incredibly
> slow due to the many for-loops and the big dataset I have (and might
> not even be completely correct yet).
> 
> QUESTION:
> I would greatly appreciate it if somebody can be give me some
> guidance/hints on what direction I should roughly think for coding the
> above a little more efficient then the horribly inefficient code
> pasted below.
> 
> Thank you in advance and best regards,
> 
> Coen
> 
> 
> for(s in unique(m$Study)){ # for each study
>  for(i in unique(m$Subject[m$Study==s & is.na(m$Wt)])){  # for each
> subject with a missing value (if $Wt is missing, all 4 columns 16:19
> are missing)
>    vals<-which(m$Study==s & m$Subject==i & !is.na(m$Wt))  #
values
> with NO missing values
>    for(w in which(m$Study==s & m$Subject==i & is.na(m$Wt))){  # for
> each value that is missing for subject "i" and study
"s"
>      if(w < min(vals) ){              # FIRST VALUES MISSING ?       #
> carry the backwards
>        m[w,][16:19]<-m[min(vals),][16:19]
>      } else if(w > max(vals) ) {      # LAST VALUES MISSING        #
> carry forwards
>        m[w,][16:19]<-m[max(vals),][16:19]
>      } else {                         # INTERMEDIATE VALUES MISSING  #
> impute missing with mean
>        maxV<-min(vals[vals>w])
>        minV<-max(vals[vals<w])
>        m[w,][16:19]<- mean(m[c(maxV,minV),][16:19],na.rm=T)
>      }
>    }
>  }
> }
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

R help - Nov 2010 - how to improve this inefficient R code for imputing missing values

[R] how to improve this inefficient R code for imputing missing values

[R] Odp: how to improve this inefficient R code for imputing missing values