Dear List members, What is the best way to test for numeric digits? suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) # [1] NA NA NA? 2 NA NA? 3 The above requires the use of the suppressWarnings function. Are there any better ways? I was working to extract chemical elements from a formula, something like this: split.symbol.character = function(x, rm.digits = TRUE) { ?? ?# Perl is partly broken in R 4.3, but this works: ?? ?regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; ?? ?# stringi::stri_split(x, regex = regex); ?? ?s = strsplit(x, regex, perl = TRUE); ?? ?if(rm.digits) { ?? ???? s = lapply(s, function(s) { ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); ?? ???? ??? s = s[isNotD]; ?? ???? }); ?? ?} ?? ?return(s); } split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) Sincerely, Leonard Note: # works: regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) # broken in R 4.3.1 # only slightly "erroneous" with stringi::stri_split regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
There are some answers on Stack Overflow: https://stackoverflow.com/questions/14984989/how-to-avoid-warning-when-introducing-nas-by-coercion On 2023-10-18 10:59 a.m., Leonard Mada via R-help wrote:> Dear List members, > > What is the best way to test for numeric digits? > > suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) > # [1] NA NA NA? 2 NA NA? 3 > The above requires the use of the suppressWarnings function. Are there > any better ways? > > I was working to extract chemical elements from a formula, something > like this: > split.symbol.character = function(x, rm.digits = TRUE) { > ?? ?# Perl is partly broken in R 4.3, but this works: > ?? ?regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > ?? ?# stringi::stri_split(x, regex = regex); > ?? ?s = strsplit(x, regex, perl = TRUE); > ?? ?if(rm.digits) { > ?? ???? s = lapply(s, function(s) { > ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); > ?? ???? ??? s = s[isNotD]; > ?? ???? }); > ?? ?} > ?? ?return(s); > } > > split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) > > > Sincerely, > > > Leonard > > > Note: > # works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > > # broken in R 4.3.1 > # only slightly "erroneous" with stringi::stri_split > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Use any occurrence of one or more digits as a separator? s <- c( "CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl" ) strsplit( s, "\\d+" ) On October 18, 2023 7:59:01 AM PDT, Leonard Mada via R-help <r-help at r-project.org> wrote:>Dear List members, > >What is the best way to test for numeric digits? > >suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) ># [1] NA NA NA? 2 NA NA? 3 >The above requires the use of the suppressWarnings function. Are there any better ways? > >I was working to extract chemical elements from a formula, something like this: >split.symbol.character = function(x, rm.digits = TRUE) { >?? ?# Perl is partly broken in R 4.3, but this works: >?? ?regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >?? ?# stringi::stri_split(x, regex = regex); >?? ?s = strsplit(x, regex, perl = TRUE); >?? ?if(rm.digits) { >?? ???? s = lapply(s, function(s) { >?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); >?? ???? ??? s = s[isNotD]; >?? ???? }); >?? ?} >?? ?return(s); >} > >split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) > > >Sincerely, > > >Leonard > > >Note: ># works: >regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > ># broken in R 4.3.1 ># only slightly "erroneous" with stringi::stri_split >regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; >strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
? Wed, 18 Oct 2023 17:59:01 +0300 Leonard Mada via R-help <r-help at r-project.org> ?????:> What is the best way to test for numeric digits? > > suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) > # [1] NA NA NA? 2 NA NA? 3 > The above requires the use of the suppressWarnings function. Are > there any better ways?This test also has the downside of accepting things like "1.2" and "+1e-100". Since you need digits only, why not use a regular expression to test for '^[0-9]+$'?> I was working to extract chemical elements from a formula, something > like this:> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))Perhaps the following function could be made to work in your cases? function(x) regmatches(x, gregexec('([A-Z][a-z]*)([0-9]*)', x)) retval[2,] is the element and retval[3,] is the coefficient. Do you need brackets? Charges? Non-stoichiometric compounds? (SMILES?)> # broken in R 4.3.1 > # only slightly "erroneous" with stringi::stri_split > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl > T)strsplit() has special historical behaviour about empty matches: https://bugs.r-project.org/show_bug.cgi?id=16745 It's unfortunate that it doesn't split on empty matches the way you would intuitively expect it to, but changing the behaviour at this point is hard. Even adding a flag may be complicated to implement. Do you want such a flag? -- Best regards, Ivan
?s 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:> Dear List members, > > What is the best way to test for numeric digits? > > suppressWarnings(as.double(c("Li", "Na", "K",? "2", "Rb", "Ca", "3"))) > # [1] NA NA NA? 2 NA NA? 3 > The above requires the use of the suppressWarnings function. Are there > any better ways? > > I was working to extract chemical elements from a formula, something > like this: > split.symbol.character = function(x, rm.digits = TRUE) { > ?? ?# Perl is partly broken in R 4.3, but this works: > ?? ?regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > ?? ?# stringi::stri_split(x, regex = regex); > ?? ?s = strsplit(x, regex, perl = TRUE); > ?? ?if(rm.digits) { > ?? ???? s = lapply(s, function(s) { > ?? ???? ??? isNotD = is.na(suppressWarnings(as.numeric(s))); > ?? ???? ??? s = s[isNotD]; > ?? ???? }); > ?? ?} > ?? ?return(s); > } > > split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) > > > Sincerely, > > > Leonard > > > Note: > # works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > > # broken in R 4.3.1 > # only slightly "erroneous" with stringi::stri_split > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, If you want to extract chemical elements symbols, the following might work. It uses the periodic table in GitHub package chemr and a package stringr function. devtools::install_github("paleolimbot/chemr") split_chem_elements <- function(x) { data(pt, package = "chemr", envir = environment()) el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)] pat <- paste(el, collapse = "|") stringr::str_extract_all(x, pat) } mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl") split_chem_elements(mol) #> [[1]] #> [1] "C" "Cl" "F" #> #> [[2]] #> [1] "Li" "Al" "H" #> #> [[3]] #> [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" It is also possible to rewrite the function without calls to non base packages but that will take some more work. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com
This seems unnecessarily complex. Or rather, it pushes the complexity into an arcane notation What we really want is something that says "here is a string, here is a pattern, give me all the substrings that match." What we're given is a function that tells us where those substrings are. # greg.matches(pattern, text) # accepts a POSIX regular expression, pattern # and a text to search in. Both arguments must be character strings # (length(...) = 1) not longer vectors of strings. # It returns a character vector of all the (non-overlapping) # substrings of text as determined by gregexpr. greg.matches <- function (pattern, text) { if (length(pattern) > 1) stop("pattern has too many elements") if (length(text) > 1) stop( "text has too many elements") match.info <- gregexpr(pattern, text) starts <- match.info[[1]] stops <- attr(starts, "match.length") - 1 + starts sapply(seq(along=starts), function (i) { substr(text, starts[i], stops[i]) }) } Given greg.matches, we can do the rest with very simple and easily comprehended regular expressions. # parse.chemical(formula) # takes a simple chemical formula "<element><count>..." and # returns a list with components # $elements -- character -- the atom symbols # $counts -- number -- the counts (missing counts taken as 1). # BEWARE. This does not handle formulas like "CH(OH)3". parse.chemical <- function (formula) { parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula) elements <- gsub("[0-9]+", "", parts) counts <- as.numeric(gsub("[^0-9]+", "", parts)) counts <- ifelse(is.na(counts), 1, counts) list(elements=elements, counts=counts) }> parse.chemical("CCl3F")$elements [1] "C" "Cl" "F" $counts [1] 1 3 1> parse.chemical("Li4Al4H16")$elements [1] "Li" "Al" "H" $counts [1] 4 4 16> parse.chemical("CCl2CO2AlPO4SiO4Cl")$elements [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" $counts [1] 1 2 1 2 1 1 4 1 4 1 On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help at r-project.org> wrote:> Dear List members, > > What is the best way to test for numeric digits? > > suppressWarnings(as.double(c("Li", "Na", "K", "2", "Rb", "Ca", "3"))) > # [1] NA NA NA 2 NA NA 3 > The above requires the use of the suppressWarnings function. Are there > any better ways? > > I was working to extract chemical elements from a formula, something > like this: > split.symbol.character = function(x, rm.digits = TRUE) { > # Perl is partly broken in R 4.3, but this works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > # stringi::stri_split(x, regex = regex); > s = strsplit(x, regex, perl = TRUE); > if(rm.digits) { > s = lapply(s, function(s) { > isNotD = is.na(suppressWarnings(as.numeric(s))); > s = s[isNotD]; > }); > } > return(s); > } > > split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) > > > Sincerely, > > > Leonard > > > Note: > # works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > > # broken in R 4.3.1 > # only slightly "erroneous" with stringi::stri_split > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
@vi@e@gross m@iii@g oii gm@ii@com
2023-Oct-20 17:27 UTC
[R] Best way to test for numeric digits?
Leonard, Since it now seems a main consideration you have is speed/efficiency, maybe a step back might help. Are there simplifying assumptions that are valid or can you make it simpler, such as converting everything to the same case? Your sample data was this and I assume your actual data is similar and far longer. c("Li", "Na", "K", "2", "Rb", "Ca", "3") So rather than use complex and costly regular expressions, or other full searches, can you just assume all entries start with either an uppercase letter orn a numeral and test for those usinnd something simple like> substr(c("Li", "Na", "K", "2", "Rb", "Ca", "3"), 1, 1)[1] "L" "N" "K" "2" "R" "C" "3" If you save that in a variable you can check if that is greater than or equal to "A" or perhaps "0" and also perhaps if it is less than or equal to "Z" or perhaps "9" and see if such a test is faster. orig <- c("Li", "Na", "K", "2", "Rb", "Ca", "3") initial <- substr(orig, 1, 1) elements_bool <- initial >= "A" & initial <= "Z" The latter contains a Boolean vector you can use to index your original and toss away the ones with digits, or any lower case letter versions or any other UNICODE symbols. orig_elements <- orig[elements_bool]> orig[1] "Li" "Na" "K" "2" "Rb" "Ca" "3"> orig_elements[1] "Li" "Na" "K" "Rb" "Ca"> orig[!elements_bool][1] "2" "3" Other approaches you might consider depending on your needs is to encapsulate your data as a column in a data.frame or tibble or other such construct and generate additional columns along the way that keep your information consolidated in what could be an efficient way especially if you shift some of your logic to using faster compiled functionality and perhaps using packages that fit your needs better such as data.table or dplyr and other things in the tidyverse. And note if using pipelines, for many purposes, the new built-in pipelines may be faster. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Leonard Mada via R-help Sent: Wednesday, October 18, 2023 10:59 AM To: R-help Mailing List <r-help at r-project.org> Subject: [R] Best way to test for numeric digits? Dear List members, What is the best way to test for numeric digits? suppressWarnings(as.double(c("Li", "Na", "K", "2", "Rb", "Ca", "3"))) # [1] NA NA NA 2 NA NA 3 The above requires the use of the suppressWarnings function. Are there any better ways? I was working to extract chemical elements from a formula, something like this: split.symbol.character = function(x, rm.digits = TRUE) { # Perl is partly broken in R 4.3, but this works: regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; # stringi::stri_split(x, regex = regex); s = strsplit(x, regex, perl = TRUE); if(rm.digits) { s = lapply(s, function(s) { isNotD = is.na(suppressWarnings(as.numeric(s))); s = s[isNotD]; }); } return(s); } split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) Sincerely, Leonard Note: # works: regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) # broken in R 4.3.1 # only slightly "erroneous" with stringi::stri_split regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.