thr3ads.net - R help - [R] regular expression, stringr::str

If this information is useful, please help other people find it:
Share via:

Sigbert Klinke

2020-Apr-28 09:29 UTC

[R] regular expression, stringr::str_view, grep

Hi,

we gave students the task to construct a regular expression selecting 
some texts. One send us back a program which gives different results on 
stringr::str_view and grep.

The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of the
regular
expression. I would have expected that all four calls would give the 
same result; interpreting [ and ] within [...] as the characters `[` and 
`]`. Obviously this not the case and moreover stringr::str_view and grep 
interpret the regular expressions differently.

Any ideas?

Thanks Sigbert

---

aff <- c("affgfking", "fgok",
"rafgkahe","a fgk", "bafghk", "affgm",
          "baffgkit", "afffhk", "affgfking",
"fgok", "rafgkahe", "afg.K",
          "bafghk", "aff gm", "baffg kit",
"afffhgk")

correct_brackets <- "af+g[^m$][^[A-Z]]"
missing_brackets <- "af+g[^m$][^[A-Z]"

library("stringr")
grep(correct_brackets, aff, value = TRUE) ### result: character(0)
grep(missing_brackets, aff, value = TRUE) ### correct result
str_view(aff, correct_brackets) ### correct result
str_view(aff, missing_brackets) ### error: missing closing bracket

-- 
https://hu.berlin/sk
https://hu.berlin/mmstat3

David Winsemius

2020-Apr-28 16:41 UTC

head link

[R] regular expression, stringr::str_view, grep

On 4/28/20 2:29 AM, Sigbert Klinke wrote:> Hi,
>
> we gave students the task to construct a regular expression selecting 
> some texts. One send us back a program which gives different results 
> on stringr::str_view and grep.
>
> The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of the
regular
> expression. I would have expected that all four calls would give the 
> same result; interpreting [ and ] within [...] as the characters `[` 
> and `]`. Obviously this not the case and moreover stringr::str_view 
> and grep interpret the regular expressions differently.
>
> Any ideas?
>
> Thanks Sigbert
>
> ---
>
> aff <- c("affgfking", "fgok",
"rafgkahe","a fgk", "bafghk", "affgm",
> ???????? "baffgkit", "afffhk", "affgfking",
"fgok", "rafgkahe", "afg.K",
> ???????? "bafghk", "aff gm", "baffg kit",
"afffhgk")
TL;DR: different versions of regex character class syntax:

>
> correct_brackets <- "af+g[^m$][^[A-Z]]"To me that looks "incorrect" because of an unnecessary
square-bracket.> missing_brackets <- "af+g[^m$][^[A-Z]"And that one looks complete. To my mind it looks like the negation of a 
character class with "[" and the range A-Z.>
> library("stringr")

I think this is the root of your problem. If you execute ?regex you 
should be given a choice of two different help pages and if you go to 
the one from pkg stringr it says in the Usage section:

regex
The default. Uses ICU regular expressions.

So that's probably different than the base regex convention which uses 
TRE regular expressions.

You should carefully review:

help('stringi-search-charclass' , pac=stringi)

 ?I think you should also find the adding square brackets around ranges 
is not needed in either type of regex syntax, but that stringi's regex 
(unlike base R's TRE regex) does allow multiple disjoint ranges inside 
the outer square brackets of a character class. I've never seen that in 
base R regex. So I think that this base regex pattern, 
grepl("([a-b]|[r-t])", letters) is the same as this stringi pattern:? 
str_view( letters, "[[a-c][r-t]]").

-- 

David.

> grep(correct_brackets, aff, value = TRUE) ### result: character(0)
> grep(missing_brackets, aff, value = TRUE) ### correct result
> str_view(aff, correct_brackets) ### correct result
> str_view(aff, missing_brackets) ### error: missing closing bracket
>

Andy Spada

2020-Apr-28 17:22 UTC

head link

[R] regular expression, stringr::str_view, grep

This highlights the literal meaning of the last ] in your correct_brackets:

aff <- c("affgfk]ing", "fgok",
"rafgkah]e","a fgk", "bafghk]")

To me, too, the missing_brackets looks more like what was desired, and
returns correct results for a PCRE. Perhaps the regular expression
should have been rewritten:

desired_brackets <- "af+g[^m$][^A-Z]"
grep(desired_brackets, aff, value = TRUE) ### correct result
str_view(aff, desired_brackets) ### correct result

Regards,
Andy


On 28.04.2020 18:41:50, David Winsemius wrote:>
> On 4/28/20 2:29 AM, Sigbert Klinke wrote:
>> Hi,
>>
>> we gave students the task to construct a regular expression selecting
>> some texts. One send us back a program which gives different results
>> on stringr::str_view and grep.
>>
>> The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of
the regular
>> expression. I would have expected that all four calls would give the
>> same result; interpreting [ and ] within [...] as the characters `[`
>> and `]`. Obviously this not the case and moreover stringr::str_view
>> and grep interpret the regular expressions differently.
>>
>> Any ideas?
>>
>> Thanks Sigbert
>>
>> ---
>>
>> aff <- c("affgfking", "fgok",
"rafgkahe","a fgk", "bafghk", "affgm",
>> ???????? "baffgkit", "afffhk",
"affgfking", "fgok", "rafgkahe",
"afg.K",
>> ???????? "bafghk", "aff gm", "baffg kit",
"afffhgk")
>
> TL;DR: different versions of regex character class syntax:
>
>
>>
>> correct_brackets <- "af+g[^m$][^[A-Z]]"
> To me that looks "incorrect" because of an unnecessary
square-bracket.
>> missing_brackets <- "af+g[^m$][^[A-Z]"
> And that one looks complete. To my mind it looks like the negation of
> a character class with "[" and the range A-Z.
>>
>> library("stringr")
>
>
> I think this is the root of your problem. If you execute ?regex you
> should be given a choice of two different help pages and if you go to
> the one from pkg stringr it says in the Usage section:
>
> regex
> The default. Uses ICU regular expressions.
>
> So that's probably different than the base regex convention which uses
> TRE regular expressions.
>
>
> You should carefully review:
>
>
> help('stringi-search-charclass' , pac=stringi)
>
> ?I think you should also find the adding square brackets around ranges
> is not needed in either type of regex syntax, but that stringi's regex
> (unlike base R's TRE regex) does allow multiple disjoint ranges inside
> the outer square brackets of a character class. I've never seen that
> in base R regex. So I think that this base regex pattern,
> grepl("([a-b]|[r-t])", letters) is the same as this stringi
pattern:?
> str_view( letters, "[[a-c][r-t]]").
>
>

R help - Apr 2020 - regular expression, stringr::str_view, grep

[R] regular expression, stringr::str_view, grep

[R] regular expression, stringr::str_view, grep

[R] regular expression, stringr::str_view, grep