I would like to be able to use gsub or gsubfn to process a formula and to translate the variables but to ignore expressions in the formula. Supposing that the R formula has already been transformed into a character string and that the transformation is to convert variable names to upper case and to append z to the names, an example would be to convert y1 + y2 ~ a*(b + c) + d + f * (h == 3) + (sex == 'male')*i to Y1z + Y2z ~ Az*(Bz + Cz) + Dz + Fz * (h == 3) + (sex == 'male')*Iz. Any expression that is not just a simple variable name would be left alone. Does anyone want to try their hand at creating a regex that would accomplish this? Thanks Frank -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University
This might be hard. How to tell f is to be changed while h is NOT ... Thanks, Guanrao http://www.myfav5.com where fun and easy friend-making happens ________________________________ From: Frank Harrell <f.harrell@vanderbilt.edu> To: RHELP <R-help@stat.math.ethz.ch> Sent: Wednesday, August 14, 2013 11:13 PM Subject: [R] regex challenge I would like to be able to use gsub or gsubfn to process a formula and to translate the variables but to ignore expressions in the formula. Supposing that the R formula has already been transformed into a character string and that the transformation is to convert variable names to upper case and to append z to the names, an example would be to convert y1 + y2 ~ a*(b + c) + d + f * (h == 3) + (sex == 'male')*i to Y1z + Y2z ~ Az*(Bz + Cz) + Dz + Fz * (h == 3) + (sex == 'male')*Iz. Any expression that is not just a simple variable name would be left alone. Does anyone want to try their hand at creating a regex that would accomplish this? Thanks Frank -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
I think substitute() or bquote() will do a better job here than gsub() be they work on the parsed formula rather than on the raw string. The terms() function will interpret the formula-specific operators like "+" and ":" to come up with a list of the 'variables' (or 'terms') in the formula E.g., with the 'f' given below we get> f(y1 + y2 ~ a*(b + c) + d + f * (h == 3) + (sex == 'male')*i)Y1z + Y2z ~ Az * (Bz + Cz) + Dz + Fz * (h == 3) + (sex == "male") * Iz Is that what you wanted? If you only wanted to keep intact the expressions of the form var==value (calls to `==`) but transform things like log(a) to log(Az) you could extend this code to do that as well. f <- function(formula) { trms <- terms(formula) variables <- as.list(attr(trms, "variables"))[-1] # the 'variables' attribute is stored as a call to list(), # so we changed the call to a list and removed the first element # to get the variables themselves. if (attr(trms, "response") == 1) { # terms does not pull apart right hand side of formula, # so we assume each non-function is to be renamed. responseVars <- lapply(all.vars(variables[[1]]), as.name) variables <- variables[-1] } else { responseVars <- list() } # omit non-name variables from list of ones to change. # This is where you could expand calls to certain functions. variables <- variables[vapply(variables, is.name, TRUE)] variables <- c(responseVars, variables) # all are names now names(variables) <- vapply(variables, as.character, "") newVars <- lapply(variables, function(v) as.name(paste0(toupper(v), "z"))) formula(do.call("substitute", list(formula, newVars)), env=environment(formula)) } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Frank Harrell > Sent: Wednesday, August 14, 2013 8:14 PM > To: RHELP > Subject: [R] regex challenge > > I would like to be able to use gsub or gsubfn to process a formula and > to translate the variables but to ignore expressions in the formula. > Supposing that the R formula has already been transformed into a > character string and that the transformation is to convert variable > names to upper case and to append z to the names, an example would be to > convert y1 + y2 ~ a*(b + c) + d + f * (h == 3) + (sex == 'male')*i to > Y1z + Y2z ~ Az*(Bz + Cz) + Dz + Fz * (h == 3) + (sex == 'male')*Iz. Any > expression that is not just a simple variable name would be left alone. > > Does anyone want to try their hand at creating a regex that would > accomplish this? > > Thanks > Frank > -- > Frank E Harrell Jr Professor and Chairman School of Medicine > Department of Biostatistics Vanderbilt University > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Here is a first stab: library(gsubfn) test <- "y1 + y2 ~ a*(b + c) + d + f * (h == 3) + (sex == 'male')*i" gsubfn( "([a-zA-Z][a-zA-Z0-9]*)((?=\\s*[-+~)*])|\\s*$)", function(x,...) paste0(toupper(x),'z'), test, perl=TRUE ) On Wed, Aug 14, 2013 at 9:13 PM, Frank Harrell <f.harrell@vanderbilt.edu>wrote:> I would like to be able to use gsub or gsubfn to process a formula and to > translate the variables but to ignore expressions in the formula. Supposing > that the R formula has already been transformed into a character string and > that the transformation is to convert variable names to upper case and to > append z to the names, an example would be to convert y1 + y2 ~ a*(b + c) + > d + f * (h == 3) + (sex == 'male')*i to Y1z + Y2z ~ Az*(Bz + Cz) + Dz + Fz > * (h == 3) + (sex == 'male')*Iz. Any expression that is not just a simple > variable name would be left alone. > > Does anyone want to try their hand at creating a regex that would > accomplish this? > > Thanks > Frank > -- > Frank E Harrell Jr Professor and Chairman School of Medicine > Department of Biostatistics Vanderbilt University > > ______________________________**________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/** > posting-guide.html <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Gregory (Greg) L. Snow Ph.D. 538280@gmail.com [[alternative HTML version deleted]]
I really appreciate the excellent ideas from Bill Dunlap and Greg Snow. Both suggestions almost work perfectly. Greg's recognizes expressions such as sex=='female' but not ones such as age > 21, age < 21, a - b > 0, and possibly other legal R expressions. Bill's idea is similar to what Duncan Murdoch suggested to me. Bill's doesn't catch the case when a variable appears both in an expression and as a regular variable (sex in the example below): f <- function(formula) { trms <- terms(formula) variables <- as.list(attr(trms, "variables"))[-1] ## the 'variables' attribute is stored as a call to list(), ## so we changed the call to a list and removed the first element ## to get the variables themselves. if (attr(trms, "response") == 1) { ## terms does not pull apart right hand side of formula, ## so we assume each non-function is to be renamed. responseVars <- lapply(all.vars(variables[[1]]), as.name) variables <- variables[-1] } else { responseVars <- list() } ## omit non-name variables from list of ones to change. ## This is where you could expand calls to certain functions. variables <- variables[vapply(variables, is.name, TRUE)] variables <- c(responseVars, variables) # all are names now names(variables) <- vapply(variables, as.character, "") newVars <- lapply(variables, function(v) as.name(paste0(toupper(v), "z"))) formula(do.call("substitute", list(formula, newVars)), env=environment(formula)) } a <- cat + (age + Heading("Females") * (sex == "Female") * sbp) * Heading() * g + (age + sbp) * Heading() * trio ~ Heading() * country * Heading() * sex f(a) Output: CATz + (AGEz + Heading("Females") * (SEXz == "Female") * SBPz) * Heading() * Gz + (AGEz + SBPz) * Heading() * TRIOz ~ Heading() * COUNTRYz * Heading() * SEXz The method also doesn't work if I replace sex == 'Female' with x3 > 4, converting to X3z > 4. I'm not clear on how to code what kind of expressions to ignore. Thanks! Frank
Bill that is very impresive. The only problem I'm having is that I want the paste0(toupper(...)) to be a general function that returns a character string that is a legal part of a formula object that can't be converted to a 'name'. Frank ------------------------------- Oops, I left "(" out of the list of operators. ff <- function(expr) { if (is.call(expr) && is.name(expr[[1]]) && is.element(as.character(expr[[1]]), c("~","+","-","*","/","%in%","("))) { for(i in seq_along(expr)[-1]) { expr[[i]] <- Recall(expr[[i]]) } } else if (is.name(expr)) { expr <- as.name(paste0(toupper(as.character(expr)), "z")) } expr } > ff(a) CATz + (AGEz + Heading("Females") * (sex == "Female") * SBPz) * Heading() * Gz + (AGEz + SBPz) * Heading() * TRIOz ~ Heading() * COUNTRYz * Heading() * SEXz Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf > Of William Dunlap > Sent: Thursday, August 15, 2013 6:03 PM > To: Frank Harrell; RHELP > Subject: Re: [R] regex challenge > > Try this one > > ff <- function (expr) > { > if (is.call(expr) && is.name(expr[[1]]) && > is.element(as.character(expr[[1]]), c("~", "+", "-", "*", "/", ":", "%in%"))) { > # the above list should cover the standard formula operators. > for (i in seq_along(expr)[-1]) { > expr[[i]] <- Recall(expr[[i]]) > } > } > else if (is.name(expr)) { > # the conversion itself > expr <- as.name(paste0(toupper(as.character(expr)), "z")) > } > expr > } > > > ff(a) > CATz + (age + Heading("Females") * (sex == "Female") * sbp) * > Heading() * Gz + (age + sbp) * Heading() * TRIOz ~ Heading() * > COUNTRYz * Heading() * SEXz > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: [hidden email] [mailto:[hidden email]] On Behalf > > Of Frank Harrell > > Sent: Thursday, August 15, 2013 4:45 PM > > To: RHELP > > Subject: Re: [R] regex challenge > > > > I really appreciate the excellent ideas from Bill Dunlap and Greg Snow. > > Both suggestions almost work perfectly. Greg's recognizes expressions > > such as sex=='female' but not ones such as age > 21, age < 21, a - b > > > 0, and possibly other legal R expressions. Bill's idea is similar to > > what Duncan Murdoch suggested to me. Bill's doesn't catch the case when > > a variable appears both in an expression and as a regular variable (sex > > in the example below): > > > > f <- function(formula) { > > trms <- terms(formula) > > variables <- as.list(attr(trms, "variables"))[-1] > > ## the 'variables' attribute is stored as a call to list(), > > ## so we changed the call to a list and removed the first element > > ## to get the variables themselves. > > if (attr(trms, "response") == 1) { > > ## terms does not pull apart right hand side of formula, > > ## so we assume each non-function is to be renamed. > > responseVars <- lapply(all.vars(variables[[1]]), as.name) > > variables <- variables[-1] > > } else { > > responseVars <- list() > > } > > ## omit non-name variables from list of ones to change. > > ## This is where you could expand calls to certain functions. > > variables <- variables[vapply(variables, is.name, TRUE)] > > variables <- c(responseVars, variables) # all are names now > > names(variables) <- vapply(variables, as.character, "") > > newVars <- lapply(variables, function(v) as.name(paste0(toupper(v), > > "z"))) > > formula(do.call("substitute", list(formula, newVars)), > > env=environment(formula)) > > } > > > > a <- cat + (age + Heading("Females") * (sex == "Female") * sbp) * > > Heading() * g + (age + sbp) * Heading() * trio ~ Heading() * > > country * Heading() * sex > > f(a) > > > > Output: > > > > CATz + (AGEz + Heading("Females") * (SEXz == "Female") * SBPz) * > > Heading() * Gz + (AGEz + SBPz) * Heading() * TRIOz ~ Heading() * > > COUNTRYz * Heading() * SEXz > > > > The method also doesn't work if I replace sex == 'Female' with x3 > 4, > > converting to X3z > 4. I'm not clear on how to code what kind of > > expressions to ignore. > > > > Thanks! > > Frank > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ... [show rest of quote] -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University
Thanks Bill. The problem is one of the results of convertName might be 'Heading("Age in Years")*age' (this is for the tables package), and as.name converts this to `Heading("...")*age` and the backticks cause the final formula to have a mixture of regular elements and ` ` quoted expression elements, making the formula invalid. Best, Frank ------------------------------------------------------------------- The following makes the name converter function an argument to ff (and restores the colon operator to the list of formula operators), but I'm not sure what you need the converter to do. ff <- function(expr, convertName = function(name)paste0(toupper(name), "z")) { if (is.call(expr) && is.name(expr[[1]]) && is.element(as.character(expr[[1]]), c("~","+","-","*","/","%in%","(", ":"))) { for(i in seq_along(expr)[-1]) { expr[[i]] <- Recall(expr[[i]], convertName = convertName) } } else if (is.name(expr)) { expr <- as.name(convertName(expr)) } expr } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf > Of Frank Harrell > Sent: Thursday, August 15, 2013 7:47 PM > To: RHELP > Subject: Re: [R] regex challenge > > Bill that is very impresive. The only problem I'm having is that I want > the paste0(toupper(...)) to be a general function that returns a > character string that is a legal part of a formula object that can't be > converted to a 'name'. > > Frank > > > ------------------------------- > Oops, I left "(" out of the list of operators. > > > ff <- function(expr) { > if (is.call(expr) && is.name(expr[[1]]) && > is.element(as.character(expr[[1]]), > c("~","+","-","*","/","%in%","("))) { > for(i in seq_along(expr)[-1]) { > expr[[i]] <- Recall(expr[[i]]) > } > } else if (is.name(expr)) { > expr <- as.name(paste0(toupper(as.character(expr)), "z")) > } > expr > } > > > ff(a) > CATz + (AGEz + Heading("Females") * (sex == "Female") * SBPz) * > Heading() * Gz + (AGEz + SBPz) * Heading() * TRIOz ~ Heading() * > COUNTRYz * Heading() * SEXz > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: [hidden email] [mailto:[hidden email]] On Behalf > > Of William Dunlap > > Sent: Thursday, August 15, 2013 6:03 PM > > To: Frank Harrell; RHELP > > Subject: Re: [R] regex challenge > > > > Try this one > > > > ff <- function (expr) > > { > > if (is.call(expr) && is.name(expr[[1]]) && > > is.element(as.character(expr[[1]]), c("~", "+", "-", "*", > "/", ":", "%in%"))) { > > # the above list should cover the standard formula operators. > > for (i in seq_along(expr)[-1]) { > > expr[[i]] <- Recall(expr[[i]]) > > } > > } > > else if (is.name(expr)) { > > # the conversion itself > > expr <- as.name(paste0(toupper(as.character(expr)), "z")) > > } > > expr > > } > > > > > ff(a) > > CATz + (age + Heading("Females") * (sex == "Female") * sbp) * > > Heading() * Gz + (age + sbp) * Heading() * TRIOz ~ Heading() * > > COUNTRYz * Heading() * SEXz > > > > Bill Dunlap > > Spotfire, TIBCO Software > > wdunlap tibco.com > > > > > > > -----Original Message----- > > > From: [hidden email] [mailto:[hidden email]] On Behalf > > > Of Frank Harrell > > > Sent: Thursday, August 15, 2013 4:45 PM > > > To: RHELP > > > Subject: Re: [R] regex challenge > > > > > > I really appreciate the excellent ideas from Bill Dunlap and Greg > Snow. > > > Both suggestions almost work perfectly. Greg's recognizes > expressions > > > such as sex=='female' but not ones such as age > 21, age < 21, a - b > > > > 0, and possibly other legal R expressions. Bill's idea is similar to > > > what Duncan Murdoch suggested to me. Bill's doesn't catch the case > when > > > a variable appears both in an expression and as a regular variable > (sex > > > in the example below): > > > > > > f <- function(formula) { > > > trms <- terms(formula) > > > variables <- as.list(attr(trms, "variables"))[-1] > > > ## the 'variables' attribute is stored as a call to list(), > > > ## so we changed the call to a list and removed the first element > > > ## to get the variables themselves. > > > if (attr(trms, "response") == 1) { > > > ## terms does not pull apart right hand side of formula, > > > ## so we assume each non-function is to be renamed. > > > responseVars <- lapply(all.vars(variables[[1]]), as.name) > > > variables <- variables[-1] > > > } else { > > > responseVars <- list() > > > } > > > ## omit non-name variables from list of ones to change. > > > ## This is where you could expand calls to certain functions. > > > variables <- variables[vapply(variables, is.name, TRUE)] > > > variables <- c(responseVars, variables) # all are names now > > > names(variables) <- vapply(variables, as.character, "") > > > newVars <- lapply(variables, function(v) as.name(paste0(toupper(v), > > > "z"))) > > > formula(do.call("substitute", list(formula, newVars)), > > > env=environment(formula)) > > > } > > > > > > a <- cat + (age + Heading("Females") * (sex == "Female") * sbp) * > > > Heading() * g + (age + sbp) * Heading() * trio ~ Heading() * > > > country * Heading() * sex > > > f(a) > > > > > > Output: > > > > > > CATz + (AGEz + Heading("Females") * (SEXz == "Female") * SBPz) * > > > Heading() * Gz + (AGEz + SBPz) * Heading() * TRIOz ~ Heading() * > > > COUNTRYz * Heading() * SEXz > > > > > > The method also doesn't work if I replace sex == 'Female' with x3 > 4, > > > converting to X3z > 4. I'm not clear on how to code what kind of > > > expressions to ignore. > > > > > > Thanks! > > > Frank > > > > > > ______________________________________________ > > > [hidden email] mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > ... [show rest of quote]
Bill I found a workaround: f <- ff(formula, lab) f <- as.formula(gsub("`", "", as.character(deparse(f)))) Thanks for your elegant solution. Frank ------------------------------ Thanks Bill. The problem is one of the results of convertName might be 'Heading("Age in Years")*age' (this is for the tables package), and as.name converts this to `Heading("...")*age` and the backticks cause the final formula to have a mixture of regular elements and ` ` quoted expression elements, making the formula invalid. Best, Frank ------------------------------------------------------------------- -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University