thr3ads.net - R help - [R] Parsing a Simple Chemical Formula [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Bryan Hanson

2010-Dec-26 23:29 UTC

[R] Parsing a Simple Chemical Formula

Hello R Folks...

I've been looking around the 'net and I see many complex solutions in  
various languages to this question, but I have a pretty simple need  
(and I'm not much good at regex).  I want to use a chemical formula as  
a function argument.  The formula would be in "Hill order" which is to
list C, then H, then all other elements in alphabetical order.  My  
example will have only a limited number of elements, few enough that  
one can search directly for each element.  So some examples would be  
C5H12, or C5H12O or C5H11BrO (note that for oxygen and bromine, O or  
Br, there is no following number meaning a 1 is implied).

Let's say

 > form <- "C5H11BrO"

I'd like to get the count of each element, so in this case I need to  
extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate the  
molecular weight by mulitplying).  Sounds pretty simple, but my  
experiments with grep and strsplit don't immediately clue me into an  
obvious solution.  As I said, I don't need a general solution to the  
problem of calculating molecular weight from an arbitrary formula,  
that seems quite challenging, just a way to convert "form" into a list
or data frame which I can then do the math on.

Here's hoping this is a simple issue for more experienced R users!   
TIA,  Bryan
***********
Bryan Hanson
Professor of Chemistry & Biochemistry

jim holtman

2010-Dec-27 00:19 UTC

head link

[R] Parsing a Simple Chemical Formula

try this:
>         f.extract <- function(formula)+ {
+     # pattern to match the initial chemical
+     # assumes chemical starts with an upper case and optional lower
case followed
+     # by zero or more digits.
+     first <- "^([[:upper:]][[:lower:]]?)([0-9]*).*"
+     # inverse of above to remove the initial chemical
+     last <- "^[[:upper:]][[:lower:]]?[0-9]*(.*)"
+     result <- list()
+     extract <- formula
+     # repeat as long as there is data
+     while ((start <- nchar(extract)) > 0){
+         chem <- sub(first, '\\1 \\2', extract)
+         extract <- sub(last, '\\1', extract)
+         # if the number of characters is the same, then there was an error
+         if (nchar(extract) == start){
+             warning("Invalid formula:", formula)
+             return(NULL)
+         }
+         # append to the list
+         result[[length(result) + 1L]] <- strsplit(chem, ' ')[[1]]
+     }
+     result
+ }> f.extract("C5H11BrO")[[1]]
[1] "C" "5"

[[2]]
[1] "H"  "11"

[[3]]
[1] "Br"

[[4]]
[1] "O"
> f.extract("H2O")[[1]]
[1] "H" "2"

[[2]]
[1] "O"
> f.extract("CCC")[[1]]
[1] "C"

[[2]]
[1] "C"

[[3]]
[1] "C"
> f.extract("Crr")  # badNULL
Warning message:
In f.extract("Crr") : Invalid formula:Crr>
>On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu>
wrote:> Hello R Folks...
>
> I've been looking around the 'net and I see many complex solutions
in
> various languages to this question, but I have a pretty simple need (and
I'm
> not much good at regex). ?I want to use a chemical formula as a function
> argument. ?The formula would be in "Hill order" which is to list
C, then H,
> then all other elements in alphabetical order. ?My example will have only a
> limited number of elements, few enough that one can search directly for
each
> element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that
> for oxygen and bromine, O or Br, there is no following number meaning a 1
is
> implied).
>
> Let's say
>
>> form <- "C5H11BrO"
>
> I'd like to get the count of each element, so in this case I need to
extract
> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
> weight by mulitplying). ?Sounds pretty simple, but my experiments with grep
> and strsplit don't immediately clue me into an obvious solution. ?As I
said,
> I don't need a general solution to the problem of calculating molecular
> weight from an arbitrary formula, that seems quite challenging, just a way
> to convert "form" into a list or data frame which I can then do
the math on.
>
> Here's hoping this is a simple issue for more experienced R users!
?TIA,
> ?Bryan
> ***********
> Bryan Hanson
> Professor of Chemistry & Biochemistry
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

David A. Johnston

2010-Dec-27 00:21 UTC

head link

[R] Parsing a Simple Chemical Formula

There might be something simpler, but this is what I came up with:

form = "C5H11BrO"
ups = c(gregexpr("[[:upper:]]", form)[[1]], nchar(form) + 1)
seperated = sapply(1:(length(ups)-1), function(x) substr(form, ups[x],
ups[x+1] - 1))
elements =  gsub("[[:digit:]]", "", seperated)
nums = gsub("[[:alpha:]]", "", seperated)
ans = data.frame(element = as.character(elements),
  num = as.numeric(ifelse(nums == "", 1, nums)), stringsAsFactors =
FALSE)
-- 
View this message in context:
http://r.789695.n4.nabble.com/Parsing-a-Simple-Chemical-Formula-tp3164562p3164581.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2010-Dec-27 00:26 UTC

head link

[R] Parsing a Simple Chemical Formula

On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu>
wrote:> Hello R Folks...
>
> I've been looking around the 'net and I see many complex solutions
in
> various languages to this question, but I have a pretty simple need (and
I'm
> not much good at regex). ?I want to use a chemical formula as a function
> argument. ?The formula would be in "Hill order" which is to list
C, then H,
> then all other elements in alphabetical order. ?My example will have only a
> limited number of elements, few enough that one can search directly for
each
> element. ?So some examples would be C5H12, or C5H12O or C5H11BrO (note that
> for oxygen and bromine, O or Br, there is no following number meaning a 1
is
> implied).
>
> Let's say
>
>> form <- "C5H11BrO"
>
> I'd like to get the count of each element, so in this case I need to
extract
> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
> weight by mulitplying). ?Sounds pretty simple, but my experiments with grep
> and strsplit don't immediately clue me into an obvious solution. ?As I
said,
> I don't need a general solution to the problem of calculating molecular
> weight from an arbitrary formula, that seems quite challenging, just a way
> to convert "form" into a list or data frame which I can then do
the math on.
>
> Here's hoping this is a simple issue for more experienced R users!
?TIA,
This can be done by strapply in gsubfn.  It matches the regular
expression to the target string passing the back references (the
parenthesized portions of the regular expression) through a specified
function as successive arguments.

Thus the first arg is form, your input string.  The second arg is the
regular expression which matches an upper case letter optionally
followed by lower case letters and all that is optionally followed by
digits.  The third arg is a function shown in a formula
representation. strapply passes the back references (i.e. the portions
within parentheses) to the function as the two arguments.  Finally
simplify is another function in formula notation which turns the
result into a matrix and then a data frame.  Finally we make the
second column of the data frame numeric.

library(gsubfn)

DF <- strapply(form,
   "([A-Z][a-z]*)(\\d*)",
   ~ c(..1, if (nchar(..2)) ..2 else 1),
   simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE))
DF[[2]] <- as.numeric(DF[[2]])

DF looks like this:
> DF  V1 V2
1  C  5
2  H 11
3 Br  1
4  O  1



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

David Winsemius

2010-Dec-27 00:41 UTC

head link

[R] Parsing a Simple Chemical Formula

On Dec 26, 2010, at 6:29 PM, Bryan Hanson wrote:
> Hello R Folks...
>
> I've been looking around the 'net and I see many complex solutions
> in various languages to this question, but I have a pretty simple  
> need (and I'm not much good at regex).  I want to use a chemical  
> formula as a function argument.  The formula would be in "Hill  
> order" which is to list C, then H, then all other elements in  
> alphabetical order.  My example will have only a limited number of  
> elements, few enough that one can search directly for each element.   
> So some examples would be C5H12, or C5H12O or C5H11BrO (note that  
> for oxygen and bromine, O or Br, there is no following number  
> meaning a 1 is implied).
>
> Let's say
>
> > form <- "C5H11BrO"
Well here's how I see it:

The "form" can be split with a regular expression:
Capital letter followed by zero or one lower, followeed by a various  
number of digits

greg <- gregexpr("[A-Z]{1}[a-z]?[0-9]*", form)

Append a number equal to one moe lan the ength for reasins that will  
become clear

ugreg <- c(unlist(greg), nchar(form)+1)

Then use substring function to serially pick from a split point to one  
minus the next split point (or in that case of the last element one  
minus the length of the string:

 > sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z],  
ugreg[z+1]-1) )
[1] "C5"  "H11" "Br"  "O"

Then you can split these "triples" (cap,lower,n) and if n is absent  
assume 1.

 > sub("(\\d*)$", "", sapply(1:(length(ugreg)-1),   #
blank out the
digits
                 function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) )
[1] "C"  "H"  "Br" "O"

sub("^$", "1", sub("([A-Za-z]*)", "",   
# subst "1" for empty strings
                     sapply(1:(length(ugreg)-1),
                           function(z) substr(form, ugreg[z], ugreg[z 
+1]-1) ) ) )
[1] "5"  "11" "1"  "1"

If you limited the number of elements searched for, it might improve  
the error trapping, I suppose.

-- 
David.

>
> I'd like to get the count of each element, so in this case I need to  
> extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate  
> the molecular weight by mulitplying).  Sounds pretty simple, but my  
> experiments with grep and strsplit don't immediately clue me into an  
> obvious solution.  As I said, I don't need a general solution to the  
> problem of calculating molecular weight from an arbitrary formula,  
> that seems quite challenging, just a way to convert "form" into a
> list or data frame which I can then do the math on.
>
> Here's hoping this is a simple issue for more experienced R users!   
> TIA,  Bryan
> ***********
> Bryan Hanson
> Professor of Chemistry & Biochemistry
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Dec 2010 - Parsing a Simple Chemical Formula

[R] Parsing a Simple Chemical Formula

[R] Parsing a Simple Chemical Formula

[R] Parsing a Simple Chemical Formula

[R] Parsing a Simple Chemical Formula

[R] Parsing a Simple Chemical Formula

Apparently Analagous Threads