thr3ads.net - R help - [R] Split String in regex while Keeping Delimiter [Apr 2023]

If this information is useful, please help other people find it:
Share via:

Leonard Mada

2023-Apr-13 18:40 UTC

[R] Split String in regex while Keeping Delimiter

Dear Emily,

Using a look-behind solves the split problem in this case. (Note: Using 
Regex is in most/many cases the simplest solution.)

str = c("leucocyten + gramnegatieve staven +++ grampositieve staven
++",
"leucocyten ? grampositieve coccen +")

tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)

PROBLEM
The current expression does NOT work for a different reason: the "-"
is
coded using a NON-ASCII character.

I have written a small utility function to approximately extract 
"non-standard" characters:
### Identify non-ASCII Characters
# beware: the filtering and the sorting may break the codes;
extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
 ?? ?code = as.numeric(unique(unlist(lapply(x, charToRaw))));
 ?? ?isLetter  ?? ???? (code >= 97 & code <= 122) |
 ?? ???? (code >= 65 & code <= 90);
 ?? ?code = code[ ! isLetter];
 ?? ?if(rm.space) {
 ?? ???? # removes only simple space!
 ?? ???? code = code[code != 32];
 ?? ?}
 ?? ?if(sort) code = sort(code);
 ?? ?return(code);
}
extract.nonLetters(str, sort = FALSE)
# 43 226 128 147

Note:
- the code for "+" is 43, and for simple "-" is 45:
as.numeric
(charToRaw("+-"));
- "226 128 147" codes something else, but it is not trivial to get the
Unicode code Point;
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec

The following is a more comprehensive Regex expression, which accepts 
many variants of "-":
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)

Sincerely,

Leonard

Leonard Mada

2023-Apr-13 19:15 UTC

head link

[R] Split String in regex while Keeping Delimiter

Dear Emily,

I have written a more robust version of the function:
extract.nonLetters = function(x, rm.space = TRUE, normalize=TRUE, 
sort=TRUE) {
 ?? ?if(normalize) str = stringi::stri_trans_nfc(str);
 ?? ?ch = strsplit(str, "", fixed = TRUE);
 ?? ?ch = unique(unlist(ch));
 ?? ?if(sort) ch = sort(ch);
 ?? ?pat = if(rm.space) "^[a-zA-Z ]" else "^[a-zA-Z]";
 ?? ?isLetter = grepl(pat, ch);
 ?? ?ch = ch[ ! isLetter];
 ?? ?return(stringi::stri_escape_unicode(ch));
}
extract.nonLetters(str)
# "\\u2013" "+"

This code ("\u2013") is included in the expanded Regex expression:
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)


Sincerely,

Leonard


On 4/13/2023 9:40 PM, Leonard Mada wrote:> Dear Emily,
>
> Using a look-behind solves the split problem in this case. (Note: 
> Using Regex is in most/many cases the simplest solution.)
>
> str = c("leucocyten + gramnegatieve staven +++ grampositieve staven
++",
> "leucocyten ? grampositieve coccen +")
>
> tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)
>
> PROBLEM
> The current expression does NOT work for a different reason: the
"-"
> is coded using a NON-ASCII character.
>
> I have written a small utility function to approximately extract 
> "non-standard" characters:
> ### Identify non-ASCII Characters
> # beware: the filtering and the sorting may break the codes;
> extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
> ?? ?code = as.numeric(unique(unlist(lapply(x, charToRaw))));
> ?? ?isLetter > ?? ???? (code >= 97 & code <= 122) |
> ?? ???? (code >= 65 & code <= 90);
> ?? ?code = code[ ! isLetter];
> ?? ?if(rm.space) {
> ?? ???? # removes only simple space!
> ?? ???? code = code[code != 32];
> ?? ?}
> ?? ?if(sort) code = sort(code);
> ?? ?return(code);
> }
> extract.nonLetters(str, sort = FALSE)
> # 43 226 128 147
>
> Note:
> - the code for "+" is 43, and for simple "-" is 45:
as.numeric
> (charToRaw("+-"));
> - "226 128 147" codes something else, but it is not trivial to
get the
> Unicode code Point;
>
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec
>
>
> The following is a more comprehensive Regex expression, which accepts 
> many variants of "-":
> tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++",
perl=TRUE)
>
> Sincerely,
>
> Leonard
>
>

Reasonably Related Threads

Search for more maybe matching threads

R help - Apr 2023 - Split String in regex while Keeping Delimiter

[R] Split String in regex while Keeping Delimiter

[R] Split String in regex while Keeping Delimiter

Reasonably Related Threads