Dear Rui, Thank you for your reply. I do have actually access to the chemical symbols: I have started to refactor and enhance the Rpdb package, see Rpdb::elements: https://github.com/discoleo/Rpdb However, the regex that you have constructed is quite heavy, as it needs to iterate through all chemical symbols (in decreasing nchar). Elements like C, and especially O, P or S, appear late in the regex expression - but are quite common in chemistry. The alternative regex is (in this respect) simpler. It actually works (once you know about the workaround). Q: My question focused if there is anything like is.numeric, but to parse each element of a vector. Sincerely, Leonard On 10/18/2023 6:53 PM, Rui Barradas wrote:> ?s 15:59 de 18/10/2023, Leonard Mada via R-help escreveu: >> Dear List members, >> >> What is the best way to test for numeric digits? >> >> suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) >> # [1] NA NA NA? 2 NA NA? 3 >> The above requires the use of the suppressWarnings function. Are there >> any better ways? >> >> I was working to extract chemical elements from a formula, something >> like this: >> split.symbol.character = function(x, rm.digits = TRUE) { >> ?? ?# Perl is partly broken in R 4.3, but this works: >> ?? ?regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >> ?? ?# stringi::stri_split(x, regex = regex); >> ?? ?s = strsplit(x, regex, perl = TRUE); >> ?? ?if(rm.digits) { >> ?? ???? s = lapply(s, function(s) { >> ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); >> ?? ???? ??? s = s[isNotD]; >> ?? ???? }); >> ?? ?} >> ?? ?return(s); >> } >> >> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) >> >> >> Sincerely, >> >> >> Leonard >> >> >> Note: >> # works: >> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) >> >> >> # broken in R 4.3.1 >> # only slightly "erroneous" with stringi::stri_split >> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb >> PLEASE do read the posting guide >> https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK >> and provide commented, minimal, self-contained, reproducible code. > Hello, > > If you want to extract chemical elements symbols, the following might work. > It uses the periodic table in GitHub package chemr and a package stringr > function. > > > devtools::install_github("paleolimbot/chemr") > > > > split_chem_elements <- function(x) { > data(pt, package = "chemr", envir = environment()) > el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)] > pat <- paste(el, collapse = "|") > stringr::str_extract_all(x, pat) > } > > mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl") > split_chem_elements(mol) > #> [[1]] > #> [1] "C" "Cl" "F" > #> > #> [[2]] > #> [1] "Li" "Al" "H" > #> > #> [[3]] > #> [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" > > > It is also possible to rewrite the function without calls to non base > packages but that will take some more work. > > Hope this helps, > > Rui Barradas > >
@vi@e@gross m@iii@g oii gm@ii@com
2023-Oct-18 17:19 UTC
[R] Best way to test for numeric digits?
Rui,
The problem with searching for elements, as with many kinds of text, is that the
optimal search order may depend on the probabilities of what is involved. There
can be more elements added such as Unobtainium in the future with whatever
abbreviations that may then change the algorithm you may have chosen but then
again, who actually looks for elements with a negligible half-life?
If you had an application focused on Organic Chemistry, a relatively few of the
elements would normally be present while for something like electronics
components of some kind, a different overlapping palette with probabilities can
be found.
Just how important is the efficiency for you? If this was in a language like
python, I would consider using a dictionary or set and I think there are
packages in R that support a version of this. In your case, one solution can be
to pre-create a dictionary of all the elements, or just a set, and take your
word tokens and check if they are in the dictionary/set or not. Any that
aren't can then be further examined as needed and if your data is set a
specific way, they may all just end up to be numeric. The cost is the hashing
and of course memory used. Your corpus of elements is small enough that this may
not be as helpful as parsing text that can contain many thousands of words.
Even in plain R, you can probably also use something like:
elements = c("H", "He", "Li", ...)
If (text %in% elements) ...
Something like the above may not be faster but can be quite a bit more readable
than the regular expressions
But plenty of the solutions others offered may well be great for your current
need.
Some may even work with Handwavium.
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Leonard Mada
via R-help
Sent: Wednesday, October 18, 2023 12:24 PM
To: Rui Barradas <ruipbarradas at sapo.pt>; R-help Mailing List <r-help
at r-project.org>
Subject: Re: [R] Best way to test for numeric digits?
Dear Rui,
Thank you for your reply.
I do have actually access to the chemical symbols: I have started to
refactor and enhance the Rpdb package, see Rpdb::elements:
https://github.com/discoleo/Rpdb
However, the regex that you have constructed is quite heavy, as it needs
to iterate through all chemical symbols (in decreasing nchar). Elements
like C, and especially O, P or S, appear late in the regex expression -
but are quite common in chemistry.
The alternative regex is (in this respect) simpler. It actually works
(once you know about the workaround).
Q: My question focused if there is anything like is.numeric, but to
parse each element of a vector.
Sincerely,
Leonard
On 10/18/2023 6:53 PM, Rui Barradas wrote:> ?s 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:
>> Dear List members,
>>
>> What is the best way to test for numeric digits?
>>
>> suppressWarnings(as.double(c("Li", "Na",
"K", "2", "Rb", "Ca", "3")))
>> # [1] NA NA NA 2 NA NA 3
>> The above requires the use of the suppressWarnings function. Are there
>> any better ways?
>>
>> I was working to extract chemical elements from a formula, something
>> like this:
>> split.symbol.character = function(x, rm.digits = TRUE) {
>> # Perl is partly broken in R 4.3, but this works:
>> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>> # stringi::stri_split(x, regex = regex);
>> s = strsplit(x, regex, perl = TRUE);
>> if(rm.digits) {
>> s = lapply(s, function(s) {
>> isNotD = is.na(suppressWarnings(as.numeric(s)));
>> s = s[isNotD];
>> });
>> }
>> return(s);
>> }
>>
>> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
>>
>>
>> Sincerely,
>>
>>
>> Leonard
>>
>>
>> Note:
>> # works:
>> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>
>>
>> # broken in R 4.3.1
>> # only slightly "erroneous" with stringi::stri_split
>> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>
https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
>> PLEASE do read the posting guide
>>
https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> If you want to extract chemical elements symbols, the following might work.
> It uses the periodic table in GitHub package chemr and a package stringr
> function.
>
>
> devtools::install_github("paleolimbot/chemr")
>
>
>
> split_chem_elements <- function(x) {
> data(pt, package = "chemr", envir = environment())
> el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
> pat <- paste(el, collapse = "|")
> stringr::str_extract_all(x, pat)
> }
>
> mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
> split_chem_elements(mol)
> #> [[1]]
> #> [1] "C" "Cl" "F"
> #>
> #> [[2]]
> #> [1] "Li" "Al" "H"
> #>
> #> [[3]]
> #> [1] "C" "Cl" "C" "O"
"Al" "P" "O" "Si" "O"
"Cl"
>
>
> It is also possible to rewrite the function without calls to non base
> packages but that will take some more work.
>
> Hope this helps,
>
> Rui Barradas
>
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
?s 17:24 de 18/10/2023, Leonard Mada escreveu:> Dear Rui, > > Thank you for your reply. > > I do have actually access to the chemical symbols: I have started to > refactor and enhance the Rpdb package, see Rpdb::elements: > https://github.com/discoleo/Rpdb > > However, the regex that you have constructed is quite heavy, as it needs > to iterate through all chemical symbols (in decreasing nchar). Elements > like C, and especially O, P or S, appear late in the regex expression - > but are quite common in chemistry. > > The alternative regex is (in this respect) simpler. It actually works > (once you know about the workaround). > > Q: My question focused if there is anything like is.numeric, but to > parse each element of a vector. > > Sincerely, > > > Leonard > > > On 10/18/2023 6:53 PM, Rui Barradas wrote: >> ?s 15:59 de 18/10/2023, Leonard Mada via R-help escreveu: >>> Dear List members, >>> >>> What is the best way to test for numeric digits? >>> >>> suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) >>> # [1] NA NA NA? 2 NA NA? 3 >>> The above requires the use of the suppressWarnings function. Are there >>> any better ways? >>> >>> I was working to extract chemical elements from a formula, something >>> like this: >>> split.symbol.character = function(x, rm.digits = TRUE) { >>> ? ?? ?# Perl is partly broken in R 4.3, but this works: >>> ? ?? ?regex = >>> "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >>> ? ?? ?# stringi::stri_split(x, regex = regex); >>> ? ?? ?s = strsplit(x, regex, perl = TRUE); >>> ? ?? ?if(rm.digits) { >>> ? ?? ???? s = lapply(s, function(s) { >>> ? ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); >>> ? ?? ???? ??? s = s[isNotD]; >>> ? ?? ???? }); >>> ? ?? ?} >>> ? ?? ?return(s); >>> } >>> >>> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) >>> >>> >>> Sincerely, >>> >>> >>> Leonard >>> >>> >>> Note: >>> # works: >>> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >>> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) >>> >>> >>> # broken in R 4.3.1 >>> # only slightly "erroneous" with stringi::stri_split >>> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >>> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb >>> PLEASE do read the posting guide >>> https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK >>> and provide commented, minimal, self-contained, reproducible code. >> Hello, >> >> If you want to extract chemical elements symbols, the following might >> work. >> It uses the periodic table in GitHub package chemr and a package stringr >> function. >> >> >> devtools::install_github("paleolimbot/chemr") >> >> >> >> split_chem_elements <- function(x) { >> ??? data(pt, package = "chemr", envir = environment()) >> ??? el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)] >> ??? pat <- paste(el, collapse = "|") >> ??? stringr::str_extract_all(x, pat) >> } >> >> mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl") >> split_chem_elements(mol) >> #> [[1]] >> #> [1] "C"? "Cl" "F" >> #> >> #> [[2]] >> #> [1] "Li" "Al" "H" >> #> >> #> [[3]] >> #>? [1] "C"? "Cl" "C"? "O"? "Al" "P"? "O"? "Si" "O"? "Cl" >> >> >> It is also possible to rewrite the function without calls to non base >> packages but that will take some more work. >> >> Hope this helps, >> >> Rui Barradas >> >>Hello, You and Avi are right, my function's performance is terrible. The following is much faster. As for how to not have digits throw warnings, the lapply in the version of your function below solves it by setting grep argument invert = TRUE. This will get all strings where digits do not occur. split_chem_elements <- function(x, rm.digits = TRUE) { regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])" if(rm.digits) { stringr::str_replace_all(mol, regex, "#") |> strsplit("#|[[:digit:]]") |> lapply(\(x) x[nchar(x) > 0L]) } else { strsplit(x, regex, perl = TRUE) } } split.symbol.character = function(x, rm.digits = TRUE) { # Perl is partly broken in R 4.3, but this works: regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])" s <- strsplit(x, regex, perl = TRUE) if(rm.digits) { s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)]) } s } mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl") split_chem_elements(mol) #> [[1]] #> [1] "C" "Cl" "F" #> #> [[2]] #> [1] "Li" "Al" "H" #> #> [[3]] #> [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" split.symbol.character(mol) #> [[1]] #> [1] "C" "Cl" "F" #> #> [[2]] #> [1] "Li" "Al" "H" #> #> [[3]] #> [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" mol10000 <- rep(mol, 10000) system.time( split_chem_elements(mol10000) ) #> user system elapsed #> 0.01 0.00 0.02 system.time( split.symbol.character(mol10000) ) #> user system elapsed #> 0.35 0.07 0.47 Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com