thr3ads.net - R help - [R] Do grep() and strsplit() use different regex engines? [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2015-Jul-11 14:47 UTC

[R] Do grep() and strsplit() use different regex engines?

I noticed the following:
> strsplit("red green","\\b")[[1]]
[1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
> strsplit("red green","\\W")[[1]]
[1] "red"   "green"

I would have thought that "\\b" should give what "\\W" did.
Note that:
> grep("\\bred\\b","red green")[1] 1
## as expected

Does strsplit use a different regex engine than grep()? Or more
likely, what am I misunderstanding?

Thanks.

Bert


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll

Jeff Newmiller

2015-Jul-11 15:52 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

"\\b" is a zero length match. strsplit seems to chop at least one
character off the beginning of the string if it sees a match, and then it looks
at the shortened string that remains and repeats.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On July 11, 2015 7:47:22 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
wrote:>I noticed the following:
>
>> strsplit("red green","\\b")
>[[1]]
>[1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
>
>> strsplit("red green","\\W")
>[[1]]
>[1] "red"   "green"
>
>I would have thought that "\\b" should give what "\\W"
did. Note that:
>
>> grep("\\bred\\b","red green")
>[1] 1
>## as expected
>
>Does strsplit use a different regex engine than grep()? Or more
>likely, what am I misunderstanding?
>
>Thanks.
>
>Bert
>
>
>Bert Gunter
>
>"Data is not information. Information is not knowledge. And knowledge
>is certainly not wisdom."
>   -- Clifford Stoll
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2015-Jul-11 16:19 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

Thanks Jeff. That doesn't explain it for me. Could you go through the
algorithm a step at a time to show why it splits at the individual
characters rather than the words, perhaps privately. Feel free to
refuse, as I'm sure you have better things to do.

-- Bert


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 8:52 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> "\\b" is a zero length match. strsplit seems to chop at least one
character off the beginning of the string if it sees a match, and then it looks
at the shortened string that remains and repeats.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On July 11, 2015 7:47:22 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>I noticed the following:
>>
>>> strsplit("red green","\\b")
>>[[1]]
>>[1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>>
>>> strsplit("red green","\\W")
>>[[1]]
>>[1] "red"   "green"
>>
>>I would have thought that "\\b" should give what
"\\W" did. Note that:
>>
>>> grep("\\bred\\b","red green")
>>[1] 1
>>## as expected
>>
>>Does strsplit use a different regex engine than grep()? Or more
>>likely, what am I misunderstanding?
>>
>>Thanks.
>>
>>Bert
>>
>>
>>Bert Gunter
>>
>>"Data is not information. Information is not knowledge. And
knowledge
>>is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>

David Winsemius

2015-Jul-11 18:05 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
> I noticed the following:
> 
>> strsplit("red green","\\b")
> [[1]]
> [1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
After reading the ?regex help page, I didn't understand why `\b` would split
within sequences of "word"-characters, either. I expected this to be
the result:

[[1]]
[1] "red"  " "  "green"

There is a warning in that paragraph: "(The interpretation of ?word?
depends on the locale and implementation.)"

I got the expected result with only one of "\\>" and
"\\<"
> strsplit("red green","\\<")[[1]]
[1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
> strsplit("red green","\\>")[[1]]
[1] "red"    " green"

The result with "\\<" seems decidedly unexpected.

I'm wondered if the "original" regex documentation uses the same
language as the R help page. So I went to the cited website and find:
======An assertion-character can be any of the following:

	? < ? Beginning of word
	? > ? End of word
	? b ? Word boundary
	? B ? Non-word boundary
	? d ? Digit character (equivalent to [[:digit:]])
	? D ? Non-digit character (equivalent to [^[:digit:]])
	? s ? Space character (equivalent to [[:space:]])
	? S ? Non-space character (equivalent to [^[:space:]])
	? w ? Word character (equivalent to [[:alnum:]_])
	? W ? Non-word character (equivalent to [^[:alnum:]_])
=======
The word-"word" appears nowhere else on that page.

>> strsplit("red green","\\W")
> [[1]]
> [1] "red"   "green"
`\W` matches the byte-width non-word characters. So the " "-character
would be discarded.
> 
> I would have thought that "\\b" should give what "\\W"
did. Note that:
> 
>> grep("\\bred\\b","red green")
> [1] 1
> ## as expected
> 
> Does strsplit use a different regex engine than grep()? Or more
> likely, what am I misunderstanding?
> 
> Thanks.
> 
> Bert
> 
> 

David Winsemius
Alameda, CA, USA

David Winsemius

2015-Jul-11 18:14 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

On Jul 11, 2015, at 11:05 AM, David Winsemius wrote:
> 
> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
> 
>> I noticed the following:
>> 
>>> strsplit("red green","\\b")
>> [[1]]
>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
> 
> After reading the ?regex help page, I didn't understand why `\b` would
split within sequences of "word"-characters, either. I expected this
to be the result:
> 
> [[1]]
> [1] "red"  " "  "green"
> 
> There is a warning in that paragraph: "(The interpretation of ?word?
depends on the locale and implementation.)"
> 
> I got the expected result with only one of "\\>" and
"\\<"
> 
>> strsplit("red green","\\<")
> [[1]]
> [1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
> 
>> strsplit("red green","\\>")
> [[1]]
> [1] "red"    " green"
> 
> The result with "\\<" seems decidedly unexpected.
> 
> I'm wondered if the "original" regex documentation uses the
same language as the R help page. So I went to the cited website and find:
> ======> An assertion-character can be any of the following:
> 
> 	? < ? Beginning of word
> 	? > ? End of word
> 	? b ? Word boundary
> 	? B ? Non-word boundary
> 	? d ? Digit character (equivalent to [[:digit:]])
> 	? D ? Non-digit character (equivalent to [^[:digit:]])
> 	? s ? Space character (equivalent to [[:space:]])
> 	? S ? Non-space character (equivalent to [^[:space:]])
> 	? w ? Word character (equivalent to [[:alnum:]_])
> 	? W ? Non-word character (equivalent to [^[:alnum:]_])
> =======> 
> The word-"word" appears nowhere else on that page.
> 
This page:

http://www.regular-expressions.info/wordboundaries.html

 implies that naked boundaries were not expected to be use and that
"\B" and "\b" were expected to be "flanking"
patterns with the real "meat" either sandwiched between them or
perhaps at either end.

   > strsplit( "     red green   blue", split="\\b  
\\b")
[[1]]
[1] "     red green" "blue"  


So perhaps there is an implicit "any-word" that follows the
"\\b" assertion?
> strsplit( "redgreen", split="\\bgreen")[[1]]
[1] "redgreen"
> strsplit( "redgreen", split="green\\b")[[1]]
[1] "red"


-- 
David.> 
>>> strsplit("red green","\\W")
>> [[1]]
>> [1] "red"   "green"
> 
> `\W` matches the byte-width non-word characters. So the "
"-character would be discarded.
> 
>> 
>> I would have thought that "\\b" should give what
"\\W" did. Note that:
>> 
>>> grep("\\bred\\b","red green")
>> [1] 1
>> ## as expected
>> 
>> Does strsplit use a different regex engine than grep()? Or more
>> likely, what am I misunderstanding?
>> 
>> Thanks.
>> 
>> Bert
>> 
>> 
> 
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Bert Gunter

2015-Jul-11 22:07 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

David/Jeff:

Thank you both.

You seem to confirm that my observation of an "infelicity" in
strsplit() is real. That is most helpful.

I found nothing in David's message 2 code that was surprising. That
is, the splits shown conform to what I would expect from "\\b" . But
not to what I originally showed and David enlarged upon in his first
message. I still don't really get why a split should occur at every
letter.

Jeff may very well have found the explanation, but I have not gone
through his code.

If the infelicities noted (are there more?) by David and me are not
really bugs -- and I would be frankly surprised if they were -- I
would suggest that perhaps they deserve mention in the strsplit() man
page. Something to the effect that "\b and \< should not be used as
split characters..." .

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
<dwinsemius at comcast.net> wrote:>
> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>
>> I noticed the following:
>>
>>> strsplit("red green","\\b")
>> [[1]]
>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>
> After reading the ?regex help page, I didn't understand why `\b` would
split within sequences of "word"-characters, either. I expected this
to be the result:
>
> [[1]]
> [1] "red"  " "  "green"
>
> There is a warning in that paragraph: "(The interpretation of ?word?
depends on the locale and implementation.)"
>
> I got the expected result with only one of "\\>" and
"\\<"
>
>> strsplit("red green","\\<")
> [[1]]
> [1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
>
>> strsplit("red green","\\>")
> [[1]]
> [1] "red"    " green"
>
> The result with "\\<" seems decidedly unexpected.
>
> I'm wondered if the "original" regex documentation uses the
same language as the R help page. So I went to the cited website and find:
> ======> An assertion-character can be any of the following:
>
>         ? < ? Beginning of word
>         ? > ? End of word
>         ? b ? Word boundary
>         ? B ? Non-word boundary
>         ? d ? Digit character (equivalent to [[:digit:]])
>         ? D ? Non-digit character (equivalent to [^[:digit:]])
>         ? s ? Space character (equivalent to [[:space:]])
>         ? S ? Non-space character (equivalent to [^[:space:]])
>         ? w ? Word character (equivalent to [[:alnum:]_])
>         ? W ? Non-word character (equivalent to [^[:alnum:]_])
> =======>
> The word-"word" appears nowhere else on that page.
>
>
>>> strsplit("red green","\\W")
>> [[1]]
>> [1] "red"   "green"
>
> `\W` matches the byte-width non-word characters. So the "
"-character would be discarded.
>
>>
>> I would have thought that "\\b" should give what
"\\W" did. Note that:
>>
>>> grep("\\bred\\b","red green")
>> [1] 1
>> ## as expected
>>
>> Does strsplit use a different regex engine than grep()? Or more
>> likely, what am I misunderstanding?
>>
>> Thanks.
>>
>> Bert
>>
>>
>
>
> David Winsemius
> Alameda, CA, USA
>

R help - Jul 2015 - Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?