thr3ads.net - R help - [R] Best way to test for numeric digits? [Oct 2023]

If this information is useful, please help other people find it:
Share via:

Leonard Mada

2023-Oct-18 14:59 UTC

[R] Best way to test for numeric digits?

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",?
"2", "Rb", "Ca", "3")))
# [1] NA NA NA? 2 NA NA? 3
The above requires the use of the suppressWarnings function. Are there 
any better ways?

I was working to extract chemical elements from a formula, something 
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
 ?? ?# Perl is partly broken in R 4.3, but this works:
 ?? ?regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
 ?? ?# stringi::stri_split(x, regex = regex);
 ?? ?s = strsplit(x, regex, perl = TRUE);
 ?? ?if(rm.digits) {
 ?? ???? s = lapply(s, function(s) {
 ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s)));
 ?? ???? ??? s = s[isNotD];
 ?? ???? });
 ?? ?}
 ?? ?return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

Ben Bolker

2023-Oct-18 15:08 UTC

head link

[R] Best way to test for numeric digits?

There are some answers on Stack Overflow:

https://stackoverflow.com/questions/14984989/how-to-avoid-warning-when-introducing-nas-by-coercion



On 2023-10-18 10:59 a.m., Leonard Mada via R-help wrote:> Dear List members,
> 
> What is the best way to test for numeric digits?
> 
> suppressWarnings(as.double(c("Li", "Na",
"K",? "2", "Rb", "Ca", "3")))
> # [1] NA NA NA? 2 NA NA? 3
> The above requires the use of the suppressWarnings function. Are there 
> any better ways?
> 
> I was working to extract chemical elements from a formula, something 
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
>  ?? ?# Perl is partly broken in R 4.3, but this works:
>  ?? ?regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>  ?? ?# stringi::stri_split(x, regex = regex);
>  ?? ?s = strsplit(x, regex, perl = TRUE);
>  ?? ?if(rm.digits) {
>  ?? ???? s = lapply(s, function(s) {
>  ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s)));
>  ?? ???? ??? s = s[isNotD];
>  ?? ???? });
>  ?? ?}
>  ?? ?return(s);
> }
> 
> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
> 
> 
> Sincerely,
> 
> 
> Leonard
> 
> 
> Note:
> # works:
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> 
> 
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2023-Oct-18 15:12 UTC

head link

[R] Best way to test for numeric digits?

Use any occurrence of one or more digits as a separator?

s <- c( "CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl" )
strsplit( s, "\\d+" )


On October 18, 2023 7:59:01 AM PDT, Leonard Mada via R-help <r-help at
r-project.org> wrote:>Dear List members,
>
>What is the best way to test for numeric digits?
>
>suppressWarnings(as.double(c("Li", "Na", "K",?
"2", "Rb", "Ca", "3")))
># [1] NA NA NA? 2 NA NA? 3
>The above requires the use of the suppressWarnings function. Are there any
better ways?
>
>I was working to extract chemical elements from a formula, something like
this:
>split.symbol.character = function(x, rm.digits = TRUE) {
>?? ?# Perl is partly broken in R 4.3, but this works:
>?? ?regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>?? ?# stringi::stri_split(x, regex = regex);
>?? ?s = strsplit(x, regex, perl = TRUE);
>?? ?if(rm.digits) {
>?? ???? s = lapply(s, function(s) {
>?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s)));
>?? ???? ??? s = s[isNotD];
>?? ???? });
>?? ?}
>?? ?return(s);
>}
>
>split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
>
>
>Sincerely,
>
>
>Leonard
>
>
>Note:
># works:
>regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
># broken in R 4.3.1
># only slightly "erroneous" with stringi::stri_split
>regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

Ivan Krylov

2023-Oct-18 15:26 UTC

head link

[R] Best way to test for numeric digits?

? Wed, 18 Oct 2023 17:59:01 +0300
Leonard Mada via R-help <r-help at r-project.org> ?????:
> What is the best way to test for numeric digits?
> 
> suppressWarnings(as.double(c("Li", "Na",
"K",? "2", "Rb", "Ca", "3")))
> # [1] NA NA NA? 2 NA NA? 3
> The above requires the use of the suppressWarnings function. Are
> there any better ways?
This test also has the downside of accepting things like "1.2" and
"+1e-100". Since you need digits only, why not use a regular
expression
to test for '^[0-9]+$'?
> I was working to extract chemical elements from a formula, something 
> like this:
> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
Perhaps the following function could be made to work in your cases?

function(x) regmatches(x, gregexec('([A-Z][a-z]*)([0-9]*)', x))

retval[2,] is the element and retval[3,] is the coefficient. Do you
need brackets? Charges? Non-stoichiometric compounds? (SMILES?)
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl > T)
strsplit() has special historical behaviour about empty matches:
https://bugs.r-project.org/show_bug.cgi?id=16745

It's unfortunate that it doesn't split on empty matches the way you
would intuitively expect it to, but changing the behaviour at this
point is hard. Even adding a flag may be complicated to implement. Do
you want such a flag?

-- 
Best regards,
Ivan

Rui Barradas

2023-Oct-18 15:53 UTC

head link

[R] Best way to test for numeric digits?

?s 15:59 de 18/10/2023, Leonard Mada via R-help
escreveu:> Dear List members,
> 
> What is the best way to test for numeric digits?
> 
> suppressWarnings(as.double(c("Li", "Na",
"K",? "2", "Rb", "Ca", "3")))
> # [1] NA NA NA? 2 NA NA? 3
> The above requires the use of the suppressWarnings function. Are there 
> any better ways?
> 
> I was working to extract chemical elements from a formula, something 
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
>  ?? ?# Perl is partly broken in R 4.3, but this works:
>  ?? ?regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>  ?? ?# stringi::stri_split(x, regex = regex);
>  ?? ?s = strsplit(x, regex, perl = TRUE);
>  ?? ?if(rm.digits) {
>  ?? ???? s = lapply(s, function(s) {
>  ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s)));
>  ?? ???? ??? s = s[isNotD];
>  ?? ???? });
>  ?? ?}
>  ?? ?return(s);
> }
> 
> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
> 
> 
> Sincerely,
> 
> 
> Leonard
> 
> 
> Note:
> # works:
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> 
> 
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

If you want to extract chemical elements symbols, the following might work.
It uses the periodic table in GitHub package chemr and a package stringr 
function.


devtools::install_github("paleolimbot/chemr")



split_chem_elements <- function(x) {
   data(pt, package = "chemr", envir = environment())
   el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
   pat <- paste(el, collapse = "|")
   stringr::str_extract_all(x, pat)
}

mol <- c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O" 
"Al" "P"  "O"  "Si" "O" 
"Cl"


It is also possible to rewrite the function without calls to non base 
packages but that will take some more work.

Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

Richard O'Keefe

2023-Oct-19 00:45 UTC

head link

[R] Best way to test for numeric digits?

This seems unnecessarily complex.  Or rather,
it pushes the complexity into an arcane notation
What we really want is something that says "here is a string,
here is a pattern, give me all the substrings that match."
What we're given is a function that tells us where those
substrings are.

# greg.matches(pattern, text)
# accepts a POSIX regular expression, pattern
# and a text to search in.  Both arguments must be character strings
# (length(...) = 1) not longer vectors of strings.
# It returns a character vector of all the (non-overlapping)
# substrings of text as determined by gregexpr.

greg.matches <- function (pattern, text) {
    if (length(pattern) > 1) stop("pattern has too many elements")
    if (length(text)    > 1) stop(   "text has too many elements")
    match.info <- gregexpr(pattern, text)
    starts <- match.info[[1]]
    stops <- attr(starts, "match.length") - 1 + starts
    sapply(seq(along=starts), function (i) {
       substr(text, starts[i], stops[i])
    })
}

Given greg.matches, we can do the rest with very simple
and easily comprehended regular expressions.

# parse.chemical(formula)
# takes a simple chemical formula "<element><count>..."
and
# returns a list with components
# $elements -- character -- the atom symbols
# $counts   -- number    -- the counts (missing counts taken as 1).
# BEWARE.  This does not handle formulas like "CH(OH)3".

parse.chemical <- function (formula) {
    parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula)
    elements <- gsub("[0-9]+", "", parts)
    counts <- as.numeric(gsub("[^0-9]+", "", parts))
    counts <- ifelse(is.na(counts), 1, counts)
    list(elements=elements, counts=counts)
}
> parse.chemical("CCl3F")$elements
[1] "C"  "Cl" "F"

$counts
[1] 1 3 1
> parse.chemical("Li4Al4H16")$elements
[1] "Li" "Al" "H"

$counts
[1]  4  4 16
> parse.chemical("CCl2CO2AlPO4SiO4Cl")$elements
 [1] "C"  "Cl" "C"  "O"  "Al"
"P"  "O"  "Si" "O"  "Cl"

$counts
 [1] 1 2 1 2 1 1 4 1 4 1


On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help at
r-project.org>
wrote:
> Dear List members,
>
> What is the best way to test for numeric digits?
>
> suppressWarnings(as.double(c("Li", "Na", "K",
"2", "Rb", "Ca", "3")))
> # [1] NA NA NA  2 NA NA  3
> The above requires the use of the suppressWarnings function. Are there
> any better ways?
>
> I was working to extract chemical elements from a formula, something
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
>      # Perl is partly broken in R 4.3, but this works:
>      regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>      # stringi::stri_split(x, regex = regex);
>      s = strsplit(x, regex, perl = TRUE);
>      if(rm.digits) {
>          s = lapply(s, function(s) {
>              isNotD = is.na(suppressWarnings(as.numeric(s)));
>              s = s[isNotD];
>          });
>      }
>      return(s);
> }
>
> split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))
>
>
> Sincerely,
>
>
> Leonard
>
>
> Note:
> # works:
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex =
"(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

@vi@e@gross m@iii@g oii gm@ii@com

2023-Oct-20 17:27 UTC

head link

[R] Best way to test for numeric digits?

Leonard,

Since it now seems a main consideration you have is speed/efficiency, maybe a
step back might help.

Are there simplifying assumptions that are valid or can you make it simpler,
such as converting everything to the same case?

Your sample data was this and I assume your actual data is similar and far
longer.

c("Li", "Na", "K",  "2", "Rb",
"Ca", "3")

So rather than use complex and costly regular expressions, or other full
searches, can you just assume all entries start with either an uppercase letter
orn a numeral and test for those usinnd something simple
like> substr(c("Li", "Na", "K",  "2",
"Rb", "Ca", "3"), 1, 1)[1] "L" "N" "K" "2" "R"
"C" "3"

If you save that in a variable you can check if that is greater than or equal to
"A" or perhaps "0" and also perhaps if it is less than or
equal to "Z" or perhaps "9" and see if such a test is
faster.

orig <- c("Li", "Na", "K",  "2",
"Rb", "Ca", "3")
initial <- substr(orig, 1, 1)
elements_bool <- initial >= "A" & initial <=
"Z"

The latter contains a Boolean vector you can use to index your original and toss
away the ones with digits, or any lower case letter versions or any other
UNICODE symbols.

orig_elements <- orig[elements_bool]
> orig[1] "Li" "Na" "K"  "2"  "Rb"
"Ca" "3" > orig_elements[1] "Li" "Na" "K"  "Rb"
"Ca"> orig[!elements_bool][1] "2" "3"

Other approaches you might consider depending on your needs is to encapsulate
your data as a column in a data.frame or tibble or other such construct and
generate additional columns along the way that keep your information
consolidated in what could be an efficient way especially if you shift some of
your logic to using faster compiled functionality and perhaps using packages
that fit your needs better such as data.table or dplyr and other things in the
tidyverse. And note if using pipelines, for many purposes, the new built-in
pipelines may be faster.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Leonard Mada
via R-help
Sent: Wednesday, October 18, 2023 10:59 AM
To: R-help Mailing List <r-help at r-project.org>
Subject: [R] Best way to test for numeric digits?

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K", 
"2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?

I was working to extract chemical elements from a formula, something 
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
     # Perl is partly broken in R 4.3, but this works:
     regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
     # stringi::stri_split(x, regex = regex);
     s = strsplit(x, regex, perl = TRUE);
     if(rm.digits) {
         s = lapply(s, function(s) {
             isNotD = is.na(suppressWarnings(as.numeric(s)));
             s = s[isNotD];
         });
     }
     return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex =
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16",
"CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Oct 2023 - Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?

[R] Best way to test for numeric digits?