thr3ads.net - R help - [R] Best way to test for numeric digits? [Oct 2023]

If this information is useful, please help other people find it:
Share via:

Rui Barradas

2023-Oct-18 17:45 UTC

[R] Best way to test for numeric digits?

?s 17:24 de 18/10/2023, Leonard Mada escreveu:> Dear Rui,
> 
> Thank you for your reply.
> 
> I do have actually access to the chemical symbols: I have started to 
> refactor and enhance the Rpdb package, see Rpdb::elements:
> https://github.com/discoleo/Rpdb
> 
> However, the regex that you have constructed is quite heavy, as it needs 
> to iterate through all chemical symbols (in decreasing nchar). Elements 
> like C, and especially O, P or S, appear late in the regex expression - 
> but are quite common in chemistry.
> 
> The alternative regex is (in this respect) simpler. It actually works 
> (once you know about the workaround).
> 
> Q: My question focused if there is anything like is.numeric, but to 
> parse each element of a vector.
> 
> Sincerely,
> 
> 
> Leonard
> 
> 
> On 10/18/2023 6:53 PM, Rui Barradas wrote:
>> ?s 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:
>>> Dear List members,
>>>
>>> What is the best way to test for numeric digits?
>>>
>>> suppressWarnings(as.double(c("Li", "Na",
"K",? "2", "Rb", "Ca", "3")))
>>> # [1] NA NA NA? 2 NA NA? 3
>>> The above requires the use of the suppressWarnings function. Are
there
>>> any better ways?
>>>
>>> I was working to extract chemical elements from a formula,
something
>>> like this:
>>> split.symbol.character = function(x, rm.digits = TRUE) {
>>> ? ?? ?# Perl is partly broken in R 4.3, but this works:
>>> ? ?? ?regex = 
>>>
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>>> ? ?? ?# stringi::stri_split(x, regex = regex);
>>> ? ?? ?s = strsplit(x, regex, perl = TRUE);
>>> ? ?? ?if(rm.digits) {
>>> ? ?? ???? s = lapply(s, function(s) {
>>> ? ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s)));
>>> ? ?? ???? ??? s = s[isNotD];
>>> ? ?? ???? });
>>> ? ?? ?}
>>> ? ?? ?return(s);
>>> }
>>>
>>> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
>>>
>>>
>>> Sincerely,
>>>
>>>
>>> Leonard
>>>
>>>
>>> Note:
>>> # works:
>>> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>>> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>>
>>>
>>> # broken in R 4.3.1
>>> # only slightly "erroneous" with stringi::stri_split
>>> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>>> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>>
https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
>>> PLEASE do read the posting guide
>>>
https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> If you want to extract chemical elements symbols, the following might 
>> work.
>> It uses the periodic table in GitHub package chemr and a package
stringr
>> function.
>>
>>
>> devtools::install_github("paleolimbot/chemr")
>>
>>
>>
>> split_chem_elements <- function(x) {
>> ??? data(pt, package = "chemr", envir = environment())
>> ??? el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
>> ??? pat <- paste(el, collapse = "|")
>> ??? stringr::str_extract_all(x, pat)
>> }
>>
>> mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
>> split_chem_elements(mol)
>> #> [[1]]
>> #> [1] "C"? "Cl" "F"
>> #>
>> #> [[2]]
>> #> [1] "Li" "Al" "H"
>> #>
>> #> [[3]]
>> #>? [1] "C"? "Cl" "C"? "O"?
"Al" "P"? "O"? "Si" "O"?
"Cl"
>>
>>
>> It is also possible to rewrite the function without calls to non base
>> packages but that will take some more work.
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>Hello,

You and Avi are right, my function's performance is terrible. The 
following is much faster.

As for how to not have digits throw warnings, the lapply in the version 
of your function below solves it by setting grep argument invert = TRUE. 
This will get all strings where digits do not occur.



split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringr::str_replace_all(mol, regex, "#") |>
       strsplit("#|[[:digit:]]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

split.symbol.character = function(x, rm.digits = TRUE) {
   # Perl is partly broken in R 4.3, but this works:
   regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   s <- strsplit(x, regex, perl = TRUE)
   if(rm.digits) {
     s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
   }
   s
}

mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O" 
"Al" "P"  "O"  "Si" "O" 
"Cl"
split.symbol.character(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O" 
"Al" "P"  "O"  "Si" "O" 
"Cl"

mol10000 <- rep(mol, 10000)

system.time(
   split_chem_elements(mol10000)
)
#>    user  system elapsed
#>    0.01    0.00    0.02
system.time(
   split.symbol.character(mol10000)
)
#>    user  system elapsed
#>    0.35    0.07    0.47



Hope this helps,

Rui Barradas

-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

Leonard Mada

2023-Oct-18 18:35 UTC

head link

[R] Best way to test for numeric digits?

Dear Rui,

On 10/18/2023 8:45 PM, Rui Barradas wrote:> split_chem_elements <- function(x, rm.digits = TRUE) {
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? if(rm.digits) {
> ??? stringr::str_replace_all(mol, regex, "#") |>
> ????? strsplit("#|[[:digit:]]") |>
> ????? lapply(\(x) x[nchar(x) > 0L])
> ? } else {
> ??? strsplit(x, regex, perl = TRUE)
> ? }
> }
>
> split.symbol.character = function(x, rm.digits = TRUE) {
> ? # Perl is partly broken in R 4.3, but this works:
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? s <- strsplit(x, regex, perl = TRUE)
> ? if(rm.digits) {
> ??? s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert =
TRUE)])
> ? }
> ? s
> }
You have a glitch (mol is hardcoded) in the code of the first function. 
The times are similar, after correcting for that glitch.

Note:
- grep("[[:digit:]]", ...) behaves almost twice as slow as
grep("[0-9]",
...)!
- corrected results below;

Sincerely,

Leonard
#######

split_chem_elements <- function(x, rm.digits = TRUE) {
 ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
 ? if(rm.digits) {
 ??? stringr::str_replace_all(x, regex, "#") |>
 ????? strsplit("#|[[:digit:]]") |>
 ????? lapply(\(x) x[nchar(x) > 0L])
 ? } else {
 ??? strsplit(x, regex, perl = TRUE)
 ? }
}

split.symbol.character = function(x, rm.digits = TRUE) {
 ? # Perl is partly broken in R 4.3, but this works:
 ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
 ? s <- strsplit(x, regex, perl = TRUE)
 ? if(rm.digits) {
 ??? s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
 ? }
 ? s
}

mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
mol10000 <- rep(mol, 10000)

system.time(
 ? split_chem_elements(mol10000)
)
#?? user? system elapsed
#?? 0.58??? 0.00??? 0.58

system.time(
 ? split.symbol.character(mol10000)
)
#?? user? system elapsed
#?? 0.67??? 0.00??? 0.67

R help - Oct 2023 - Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?