Jonathan Baron
2004-Nov-30 16:52 UTC
[R] impute missing values in correlated variables: transcan?
I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, "transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables." And, "By default, transcan imputes NAs with "best guess" expected values of transformed variables, back transformed to the original scale." But I can't get it to work. I say m1 <- matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) <- paste("R",1:4,sep="") m1[c(2,19)] <- NA # simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted I've tried a few other things, but I think it is time to ask for help. The specific problem is a real one. Our graduate admissions committee (4 members) rates applications, and we average the ratings to get an overall rating for each applicant. Sometimes one of the committee members is absent, or late; hence the missing data. The members differ in the way they use the rating scale, in both slope and intercept (if you regress each on the mean). Many decisions end up depending on the second decimal place of the averages, so we want to do better than just averging the non-missing ratings. Maybe I'm just not seeing something really simple. In fact, the problem is simpler than transcan assumes, since we are willing to assume linearity of the regression of each variable on the other variables. Other members proposed solutions that assumed this, but they did not take into account the fact that missing data at the high or low end of each variable (each member's ratings) would change its mean. Jon -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: sas.upenn.edu/~baron R search page: finzi.psych.upenn.edu
roger koenker
2004-Nov-30 17:23 UTC
[R] impute missing values in correlated variables: transcan?
At the risk of stirring up a hornet's nest , I'd suggest that means are dangerous in such applications. A nice paper on combining ratings is: Gilbert Bassett and Joseph Persky, Rating Skating, JASA, 1994, 1075-1079. url: econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Nov 30, 2004, at 10:52 AM, Jonathan Baron wrote:> I would like to impute missing data in a set of correlated > variables (columns of a matrix). It looks like transcan() from > Hmisc is roughly what I want. It says, "transcan automatically > transforms continuous and categorical variables to have maximum > correlation with the best linear combination of the other > variables." And, "By default, transcan imputes NAs with "best > guess" expected values of transformed variables, back transformed > to the original scale." > > But I can't get it to work. I say > > m1 <- matrix(1:20+rnorm(20),5,) # four correlated variables > colnames(m1) <- paste("R",1:4,sep="") > m1[c(2,19)] <- NA # simulate some missing data > library(Hmisc) > transcan(m1,data=m1) > > and I get > > Error in rcspline.eval(y, nk = nk, inclx = TRUE) : > fewer than 6 non-missing observations with knots omitted > > I've tried a few other things, but I think it is time to ask for > help. > > The specific problem is a real one. Our graduate admissions > committee (4 members) rates applications, and we average the > ratings to get an overall rating for each applicant. Sometimes > one of the committee members is absent, or late; hence the > missing data. The members differ in the way they use the rating > scale, in both slope and intercept (if you regress each on the > mean). Many decisions end up depending on the second decimal > place of the averages, so we want to do better than just averging > the non-missing ratings. > > Maybe I'm just not seeing something really simple. In fact, the > problem is simpler than transcan assumes, since we are willing to > assume linearity of the regression of each variable on the other > variables. Other members proposed solutions that assumed this, > but they did not take into account the fact that missing data at > the high or low end of each variable (each member's ratings) > would change its mean. > > Jon > -- > Jonathan Baron, Professor of Psychology, University of Pennsylvania > Home page: sas.upenn.edu/~baron > R search page: finzi.psych.upenn.edu > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > R-project.org/posting-guide.html
Frank E Harrell Jr
2004-Nov-30 19:21 UTC
[R] impute missing values in correlated variables: transcan?
Jonathan Baron wrote:> I would like to impute missing data in a set of correlated > variables (columns of a matrix). It looks like transcan() from > Hmisc is roughly what I want. It says, "transcan automatically > transforms continuous and categorical variables to have maximum > correlation with the best linear combination of the other > variables." And, "By default, transcan imputes NAs with "best > guess" expected values of transformed variables, back transformed > to the original scale." > > But I can't get it to work. I say > > m1 <- matrix(1:20+rnorm(20),5,) # four correlated variables > colnames(m1) <- paste("R",1:4,sep="") > m1[c(2,19)] <- NA # simulate some missing data > library(Hmisc) > transcan(m1,data=m1) > > and I get > > Error in rcspline.eval(y, nk = nk, inclx = TRUE) : > fewer than 6 non-missing observations with knots omittedJonathan - you would need many more observations to be able to fit flexible additive models as transcan does. Also note that single imputation has problems and you may want to consider multiple imputation as done by the Hmisc aregImpute function, if you had more data. Frank> > I've tried a few other things, but I think it is time to ask for > help. > > The specific problem is a real one. Our graduate admissions > committee (4 members) rates applications, and we average the > ratings to get an overall rating for each applicant. Sometimes > one of the committee members is absent, or late; hence the > missing data. The members differ in the way they use the rating > scale, in both slope and intercept (if you regress each on the > mean). Many decisions end up depending on the second decimal > place of the averages, so we want to do better than just averging > the non-missing ratings. > > Maybe I'm just not seeing something really simple. In fact, the > problem is simpler than transcan assumes, since we are willing to > assume linearity of the regression of each variable on the other > variables. Other members proposed solutions that assumed this, > but they did not take into account the fact that missing data at > the high or low end of each variable (each member's ratings) > would change its mean. > > Jon-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University