thr3ads.net - R help - [R] Best way to test for numeric digits? [Oct 2023]

If this information is useful, please help other people find it:
Share via:

Leonard Mada

2023-Oct-18 18:35 UTC

[R] Best way to test for numeric digits?

Dear Rui,

On 10/18/2023 8:45 PM, Rui Barradas wrote:> split_chem_elements <- function(x, rm.digits = TRUE) {
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? if(rm.digits) {
> ??? stringr::str_replace_all(mol, regex, "#") |>
> ????? strsplit("#|[[:digit:]]") |>
> ????? lapply(\(x) x[nchar(x) > 0L])
> ? } else {
> ??? strsplit(x, regex, perl = TRUE)
> ? }
> }
>
> split.symbol.character = function(x, rm.digits = TRUE) {
> ? # Perl is partly broken in R 4.3, but this works:
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? s <- strsplit(x, regex, perl = TRUE)
> ? if(rm.digits) {
> ??? s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert =
TRUE)])
> ? }
> ? s
> }
You have a glitch (mol is hardcoded) in the code of the first function. 
The times are similar, after correcting for that glitch.

Note:
- grep("[[:digit:]]", ...) behaves almost twice as slow as
grep("[0-9]",
...)!
- corrected results below;

Sincerely,

Leonard
#######

split_chem_elements <- function(x, rm.digits = TRUE) {
 ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
 ? if(rm.digits) {
 ??? stringr::str_replace_all(x, regex, "#") |>
 ????? strsplit("#|[[:digit:]]") |>
 ????? lapply(\(x) x[nchar(x) > 0L])
 ? } else {
 ??? strsplit(x, regex, perl = TRUE)
 ? }
}

split.symbol.character = function(x, rm.digits = TRUE) {
 ? # Perl is partly broken in R 4.3, but this works:
 ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
 ? s <- strsplit(x, regex, perl = TRUE)
 ? if(rm.digits) {
 ??? s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
 ? }
 ? s
}

mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
mol10000 <- rep(mol, 10000)

system.time(
 ? split_chem_elements(mol10000)
)
#?? user? system elapsed
#?? 0.58??? 0.00??? 0.58

system.time(
 ? split.symbol.character(mol10000)
)
#?? user? system elapsed
#?? 0.67??? 0.00??? 0.67

Rui Barradas

2023-Oct-18 18:54 UTC

head link

[R] Best way to test for numeric digits?

?s 19:35 de 18/10/2023, Leonard Mada escreveu:> Dear Rui,
> 
> On 10/18/2023 8:45 PM, Rui Barradas wrote:
>> split_chem_elements <- function(x, rm.digits = TRUE) {
>> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? if(rm.digits) {
>> ??? stringr::str_replace_all(mol, regex, "#") |>
>> ????? strsplit("#|[[:digit:]]") |>
>> ????? lapply(\(x) x[nchar(x) > 0L])
>> ? } else {
>> ??? strsplit(x, regex, perl = TRUE)
>> ? }
>> }
>>
>> split.symbol.character = function(x, rm.digits = TRUE) {
>> ? # Perl is partly broken in R 4.3, but this works:
>> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? s <- strsplit(x, regex, perl = TRUE)
>> ? if(rm.digits) {
>> ??? s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert =
TRUE)])
>> ? }
>> ? s
>> }
> 
> You have a glitch (mol is hardcoded) in the code of the first function. 
> The times are similar, after correcting for that glitch.
> 
> Note:
> - grep("[[:digit:]]", ...) behaves almost twice as slow as
grep("[0-9]",
> ...)!
> - corrected results below;
> 
> Sincerely,
> 
> Leonard
> #######
> 
> split_chem_elements <- function(x, rm.digits = TRUE) {
>  ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>  ? if(rm.digits) {
>  ??? stringr::str_replace_all(x, regex, "#") |>
>  ????? strsplit("#|[[:digit:]]") |>
>  ????? lapply(\(x) x[nchar(x) > 0L])
>  ? } else {
>  ??? strsplit(x, regex, perl = TRUE)
>  ? }
> }
> 
> split.symbol.character = function(x, rm.digits = TRUE) {
>  ? # Perl is partly broken in R 4.3, but this works:
>  ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>  ? s <- strsplit(x, regex, perl = TRUE)
>  ? if(rm.digits) {
>  ??? s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
>  ? }
>  ? s
> }
> 
> mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
> mol10000 <- rep(mol, 10000)
> 
> system.time(
>  ? split_chem_elements(mol10000)
> )
> #?? user? system elapsed
> #?? 0.58??? 0.00??? 0.58
> 
> system.time(
>  ? split.symbol.character(mol10000)
> )
> #?? user? system elapsed
> #?? 0.67??? 0.00??? 0.67
> Hello,

You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the 
package stringi function stri_replace_all_regex and the improvement is 
significant.


split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringi::stri_replace_all_regex(x, "#", regex) |>
       strsplit("#|[0-9]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

# system.time(
#   split_chem_elements(mol10000)
# )
#  user  system elapsed
#  0.06    0.00    0.09
# system.time(
#   split.symbol.character(mol10000)
# )
#  user  system elapsed
#  0.25    0.00    0.28



Hope this helps,

Rui Barradas




-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

R help - Oct 2023 - Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?