Might be better off using a web service like ChemSpider to do the matching for
you <http://www.chemspider.com/AboutServices.aspx?>. The idea that you
can identify the synonyms by name is probably optimistic unless they are exact
matches.
Here's some python code that seems to make it pretty easy:
https://github.com/mcs07/ChemSpiPy. Search the names, extract the InChI for the
best match and then you can match them in R via the InChI. Might require some
fixing by hand afterwards.
HTH,
Jason Law
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Zsurzsa Laszlo
Sent: Wednesday, July 03, 2013 7:28 AM
To: r-help at r-project.org
Subject: [R] String based chemical name identification
The problem is the following:
I have two big databases one look like this:
2-Methyl-4-trimethylsilyloxyoct-5-yne Benzoic acid, methyl ester Benzoic
acid, 2-methyl-, methyl ester Acetic acid, phenylmethyl ester
2,7-Dimethyl-4-trimethylsilyloxyoct-7-en-5-yne etc.
The second one looks like this:
Name: D-Tagatose 1,6-bisphosphate Name: 1-Phosphatidyl-D-myo-inositol;:
1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;:
Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;:
1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol Name:
Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione Name:
Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine Name: H+;: Hydron
Name:
3-Iodo-L-tyrosine etc.
Both of them have more then 3000 lines. Matching their name by hand is not an
option because I don't know chemistry.
*Possible solution I came up with*:
Go through all the names of the first database and then try to match with the
other one. I'm using *regexec *and *strsplit *functions for the matching.
Basically I split the name into small chunks and try to get some hit in the
other database.
I can supply code If needed but I did not want to spam in the first mail.
Any solution is welcome! It can be in pseudo-cod also or in any type of logical
arguing. It does not matter.
Laszlo-Andras Zsurzsa
Msc. Informatics, Technical University Munchen
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.