Dear Emily,
I have written a more robust version of the function:
extract.nonLetters = function(x, rm.space = TRUE, normalize=TRUE,
sort=TRUE) {
?? ?if(normalize) str = stringi::stri_trans_nfc(str);
?? ?ch = strsplit(str, "", fixed = TRUE);
?? ?ch = unique(unlist(ch));
?? ?if(sort) ch = sort(ch);
?? ?pat = if(rm.space) "^[a-zA-Z ]" else "^[a-zA-Z]";
?? ?isLetter = grepl(pat, ch);
?? ?ch = ch[ ! isLetter];
?? ?return(stringi::stri_escape_unicode(ch));
}
extract.nonLetters(str)
# "\\u2013" "+"
This code ("\u2013") is included in the expanded Regex expression:
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)
Sincerely,
Leonard
On 4/13/2023 9:40 PM, Leonard Mada wrote:> Dear Emily,
>
> Using a look-behind solves the split problem in this case. (Note:
> Using Regex is in most/many cases the simplest solution.)
>
> str = c("leucocyten + gramnegatieve staven +++ grampositieve staven
++",
> "leucocyten ? grampositieve coccen +")
>
> tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)
>
> PROBLEM
> The current expression does NOT work for a different reason: the
"-"
> is coded using a NON-ASCII character.
>
> I have written a small utility function to approximately extract
> "non-standard" characters:
> ### Identify non-ASCII Characters
> # beware: the filtering and the sorting may break the codes;
> extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
> ?? ?code = as.numeric(unique(unlist(lapply(x, charToRaw))));
> ?? ?isLetter > ?? ???? (code >= 97 & code <= 122) |
> ?? ???? (code >= 65 & code <= 90);
> ?? ?code = code[ ! isLetter];
> ?? ?if(rm.space) {
> ?? ???? # removes only simple space!
> ?? ???? code = code[code != 32];
> ?? ?}
> ?? ?if(sort) code = sort(code);
> ?? ?return(code);
> }
> extract.nonLetters(str, sort = FALSE)
> # 43 226 128 147
>
> Note:
> - the code for "+" is 43, and for simple "-" is 45:
as.numeric
> (charToRaw("+-"));
> - "226 128 147" codes something else, but it is not trivial to
get the
> Unicode code Point;
>
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec
>
>
> The following is a more comprehensive Regex expression, which accepts
> many variants of "-":
> tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++",
perl=TRUE)
>
> Sincerely,
>
> Leonard
>
>