Hi, Vince and I have noticed a problem with non-syntactic names in data frames and some modeling code (but not all modeling code). The following, while almost surely as documented could be a bit more helpful: m = matrix(rnorm(100), nc=10) colnames(m) = paste(1:10, letters[1:10], sep="_") d = data.frame(m, check.names=FALSE) f = formula(`1_a` ~ ., data=d) tm = terms(f, data=d) ##failure here, as somehow back-ticks have become part of the name ##not a quoting mechanism d[attr(tm, "term.labels")] The variable attribute, in the terms object, keeps them as quotes, so modeling code that uses that attribute seems fine, but code that uses the term.labels fails. In particular, it seems (of those tested) that glm, lda, randomForest seem to work fine, while nnet, rpart can't handle nonsyntactic names in formulae as such In particlar, rpart contains this code: lapply(m[attr(Terms, "term.labels")], tfun) which fails for the reasons given. One way to get around this, might be to modify the do_termsform code, right now we have: PROTECT(varnames = allocVector(STRSXP, nvar)); for (v = CDR(varlist), i = 0; v != R_NilValue; v = CDR(v)) SET_STRING_ELT(varnames, i++, STRING_ELT(deparse1line(CAR(v), 0), 0)); and then for term.labels, we copy over the varnames (with :, as needed) and perhaps we need to save the unquoted names somewhere? Or is there some other approach that will get us there? Certainly cleaning up the names via cleanTick = function(x) gsub("`", "", x) works, but it seems a bit ugly, and it might be better if the modeling code was modified. best wishes -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org
Robert Gentleman wrote:> Hi, > Vince and I have noticed a problem with non-syntactic names in data > frames and some modeling code (but not all modeling code). > > The following, while almost surely as documented could be a bit more > helpful: > > m = matrix(rnorm(100), nc=10) > colnames(m) = paste(1:10, letters[1:10], sep="_") > > d = data.frame(m, check.names=FALSE) > > f = formula(`1_a` ~ ., data=d) > > tm = terms(f, data=d) > > ##failure here, as somehow back-ticks have become part of the name > ##not a quoting mechanism > d[attr(tm, "term.labels")] > > The variable attribute, in the terms object, keeps them as quotes, so > modeling code that uses that attribute seems fine, but code that uses > the term.labels fails. In particular, it seems (of those tested) that > glm, lda, randomForest seem to work fine, while nnet, rpart can't > handle nonsyntactic names in formulae as such > > In particlar, rpart contains this code: > > lapply(m[attr(Terms, "term.labels")], tfun) > > which fails for the reasons given. > > > One way to get around this, might be to modify the do_termsform code, > right now we have: > PROTECT(varnames = allocVector(STRSXP, nvar)); > for (v = CDR(varlist), i = 0; v != R_NilValue; v = CDR(v)) > SET_STRING_ELT(varnames, i++, STRING_ELT(deparse1line(CAR(v), > 0), 0)); > > and then for term.labels, we copy over the varnames (with :, as > needed) and perhaps we need to save the unquoted names somewhere? > > Or is there some other approach that will get us there? Certainly > cleaning up the names via > cleanTick = function(x) gsub("`", "", x) > > works, but it seems a bit ugly, and it might be better if the modeling > code was modified. > >Hmm, .Internal(deparse(....)) has a backtick option (for related reasons, IIRC). Could this be used instead of deparse1line? (There's an inbuilt contradiction in having special terms like "(Intercept)" and at the same time allowing arbitrary non-syntactical names, but I suppose that people who actually name their variables `(Intercept)` deserve whatever they get.)