Hello R Folks... I've been looking around the 'net and I see many complex solutions in various languages to this question, but I have a pretty simple need (and I'm not much good at regex). I want to use a chemical formula as a function argument. The formula would be in "Hill order" which is to list C, then H, then all other elements in alphabetical order. My example will have only a limited number of elements, few enough that one can search directly for each element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or Br, there is no following number meaning a 1 is implied). Let's say > form <- "C5H11BrO" I'd like to get the count of each element, so in this case I need to extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular weight by mulitplying). Sounds pretty simple, but my experiments with grep and strsplit don't immediately clue me into an obvious solution. As I said, I don't need a general solution to the problem of calculating molecular weight from an arbitrary formula, that seems quite challenging, just a way to convert "form" into a list or data frame which I can then do the math on. Here's hoping this is a simple issue for more experienced R users! TIA, Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry
try this:> f.extract <- function(formula)+ { + # pattern to match the initial chemical + # assumes chemical starts with an upper case and optional lower case followed + # by zero or more digits. + first <- "^([[:upper:]][[:lower:]]?)([0-9]*).*" + # inverse of above to remove the initial chemical + last <- "^[[:upper:]][[:lower:]]?[0-9]*(.*)" + result <- list() + extract <- formula + # repeat as long as there is data + while ((start <- nchar(extract)) > 0){ + chem <- sub(first, '\\1 \\2', extract) + extract <- sub(last, '\\1', extract) + # if the number of characters is the same, then there was an error + if (nchar(extract) == start){ + warning("Invalid formula:", formula) + return(NULL) + } + # append to the list + result[[length(result) + 1L]] <- strsplit(chem, ' ')[[1]] + } + result + }> f.extract("C5H11BrO")[[1]] [1] "C" "5" [[2]] [1] "H" "11" [[3]] [1] "Br" [[4]] [1] "O"> f.extract("H2O")[[1]] [1] "H" "2" [[2]] [1] "O"> f.extract("CCC")[[1]] [1] "C" [[2]] [1] "C" [[3]] [1] "C"> f.extract("Crr") # badNULL Warning message: In f.extract("Crr") : Invalid formula:Crr> >On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:> Hello R Folks... > > I've been looking around the 'net and I see many complex solutions in > various languages to this question, but I have a pretty simple need (and I'm > not much good at regex). ?I want to use a chemical formula as a function > argument. ?The formula would be in "Hill order" which is to list C, then H, > then all other elements in alphabetical order. ?My example will have only a > limited number of elements, few enough that one can search directly for each > element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that > for oxygen and bromine, O or Br, there is no following number meaning a 1 is > implied). > > Let's say > >> form <- "C5H11BrO" > > I'd like to get the count of each element, so in this case I need to extract > C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular > weight by mulitplying). ?Sounds pretty simple, but my experiments with grep > and strsplit don't immediately clue me into an obvious solution. ?As I said, > I don't need a general solution to the problem of calculating molecular > weight from an arbitrary formula, that seems quite challenging, just a way > to convert "form" into a list or data frame which I can then do the math on. > > Here's hoping this is a simple issue for more experienced R users! ?TIA, > ?Bryan > *********** > Bryan Hanson > Professor of Chemistry & Biochemistry > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
There might be something simpler, but this is what I came up with: form = "C5H11BrO" ups = c(gregexpr("[[:upper:]]", form)[[1]], nchar(form) + 1) seperated = sapply(1:(length(ups)-1), function(x) substr(form, ups[x], ups[x+1] - 1)) elements = gsub("[[:digit:]]", "", seperated) nums = gsub("[[:alpha:]]", "", seperated) ans = data.frame(element = as.character(elements), num = as.numeric(ifelse(nums == "", 1, nums)), stringsAsFactors = FALSE) -- View this message in context: http://r.789695.n4.nabble.com/Parsing-a-Simple-Chemical-Formula-tp3164562p3164581.html Sent from the R help mailing list archive at Nabble.com.
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:> Hello R Folks... > > I've been looking around the 'net and I see many complex solutions in > various languages to this question, but I have a pretty simple need (and I'm > not much good at regex). ?I want to use a chemical formula as a function > argument. ?The formula would be in "Hill order" which is to list C, then H, > then all other elements in alphabetical order. ?My example will have only a > limited number of elements, few enough that one can search directly for each > element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that > for oxygen and bromine, O or Br, there is no following number meaning a 1 is > implied). > > Let's say > >> form <- "C5H11BrO" > > I'd like to get the count of each element, so in this case I need to extract > C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular > weight by mulitplying). ?Sounds pretty simple, but my experiments with grep > and strsplit don't immediately clue me into an obvious solution. ?As I said, > I don't need a general solution to the problem of calculating molecular > weight from an arbitrary formula, that seems quite challenging, just a way > to convert "form" into a list or data frame which I can then do the math on. > > Here's hoping this is a simple issue for more experienced R users! ?TIA,This can be done by strapply in gsubfn. It matches the regular expression to the target string passing the back references (the parenthesized portions of the regular expression) through a specified function as successive arguments. Thus the first arg is form, your input string. The second arg is the regular expression which matches an upper case letter optionally followed by lower case letters and all that is optionally followed by digits. The third arg is a function shown in a formula representation. strapply passes the back references (i.e. the portions within parentheses) to the function as the two arguments. Finally simplify is another function in formula notation which turns the result into a matrix and then a data frame. Finally we make the second column of the data frame numeric. library(gsubfn) DF <- strapply(form, "([A-Z][a-z]*)(\\d*)", ~ c(..1, if (nchar(..2)) ..2 else 1), simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE)) DF[[2]] <- as.numeric(DF[[2]]) DF looks like this:> DFV1 V2 1 C 5 2 H 11 3 Br 1 4 O 1 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Dec 26, 2010, at 6:29 PM, Bryan Hanson wrote:> Hello R Folks... > > I've been looking around the 'net and I see many complex solutions > in various languages to this question, but I have a pretty simple > need (and I'm not much good at regex). I want to use a chemical > formula as a function argument. The formula would be in "Hill > order" which is to list C, then H, then all other elements in > alphabetical order. My example will have only a limited number of > elements, few enough that one can search directly for each element. > So some examples would be C5H12, or C5H12O or C5H11BrO (note that > for oxygen and bromine, O or Br, there is no following number > meaning a 1 is implied). > > Let's say > > > form <- "C5H11BrO"Well here's how I see it: The "form" can be split with a regular expression: Capital letter followed by zero or one lower, followeed by a various number of digits greg <- gregexpr("[A-Z]{1}[a-z]?[0-9]*", form) Append a number equal to one moe lan the ength for reasins that will become clear ugreg <- c(unlist(greg), nchar(form)+1) Then use substring function to serially pick from a split point to one minus the next split point (or in that case of the last element one minus the length of the string: > sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) [1] "C5" "H11" "Br" "O" Then you can split these "triples" (cap,lower,n) and if n is absent assume 1. > sub("(\\d*)$", "", sapply(1:(length(ugreg)-1), # blank out the digits function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) ) [1] "C" "H" "Br" "O" sub("^$", "1", sub("([A-Za-z]*)", "", # subst "1" for empty strings sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z], ugreg[z +1]-1) ) ) ) [1] "5" "11" "1" "1" If you limited the number of elements searched for, it might improve the error trapping, I suppose. -- David.> > I'd like to get the count of each element, so in this case I need to > extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate > the molecular weight by mulitplying). Sounds pretty simple, but my > experiments with grep and strsplit don't immediately clue me into an > obvious solution. As I said, I don't need a general solution to the > problem of calculating molecular weight from an arbitrary formula, > that seems quite challenging, just a way to convert "form" into a > list or data frame which I can then do the math on. > > Here's hoping this is a simple issue for more experienced R users! > TIA, Bryan > *********** > Bryan Hanson > Professor of Chemistry & Biochemistry > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT