?s 19:35 de 18/10/2023, Leonard Mada escreveu:> Dear Rui,
>
> On 10/18/2023 8:45 PM, Rui Barradas wrote:
>> split_chem_elements <- function(x, rm.digits = TRUE) {
>> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? if(rm.digits) {
>> ??? stringr::str_replace_all(mol, regex, "#") |>
>> ????? strsplit("#|[[:digit:]]") |>
>> ????? lapply(\(x) x[nchar(x) > 0L])
>> ? } else {
>> ??? strsplit(x, regex, perl = TRUE)
>> ? }
>> }
>>
>> split.symbol.character = function(x, rm.digits = TRUE) {
>> ? # Perl is partly broken in R 4.3, but this works:
>> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
>> ? s <- strsplit(x, regex, perl = TRUE)
>> ? if(rm.digits) {
>> ??? s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert =
TRUE)])
>> ? }
>> ? s
>> }
>
> You have a glitch (mol is hardcoded) in the code of the first function.
> The times are similar, after correcting for that glitch.
>
> Note:
> - grep("[[:digit:]]", ...) behaves almost twice as slow as
grep("[0-9]",
> ...)!
> - corrected results below;
>
> Sincerely,
>
> Leonard
> #######
>
> split_chem_elements <- function(x, rm.digits = TRUE) {
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? if(rm.digits) {
> ??? stringr::str_replace_all(x, regex, "#") |>
> ????? strsplit("#|[[:digit:]]") |>
> ????? lapply(\(x) x[nchar(x) > 0L])
> ? } else {
> ??? strsplit(x, regex, perl = TRUE)
> ? }
> }
>
> split.symbol.character = function(x, rm.digits = TRUE) {
> ? # Perl is partly broken in R 4.3, but this works:
> ? regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
> ? s <- strsplit(x, regex, perl = TRUE)
> ? if(rm.digits) {
> ??? s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
> ? }
> ? s
> }
>
> mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
> mol10000 <- rep(mol, 10000)
>
> system.time(
> ? split_chem_elements(mol10000)
> )
> #?? user? system elapsed
> #?? 0.58??? 0.00??? 0.58
>
> system.time(
> ? split.symbol.character(mol10000)
> )
> #?? user? system elapsed
> #?? 0.67??? 0.00??? 0.67
>
Hello,
You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the
package stringi function stri_replace_all_regex and the improvement is
significant.
split_chem_elements <- function(x, rm.digits = TRUE) {
regex <-
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
if(rm.digits) {
stringi::stri_replace_all_regex(x, "#", regex) |>
strsplit("#|[0-9]") |>
lapply(\(x) x[nchar(x) > 0L])
} else {
strsplit(x, regex, perl = TRUE)
}
}
# system.time(
# split_chem_elements(mol10000)
# )
# user system elapsed
# 0.06 0.00 0.09
# system.time(
# split.symbol.character(mol10000)
# )
# user system elapsed
# 0.25 0.00 0.28
Hope this helps,
Rui Barradas
--
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com