thr3ads.net - R help - [R] Do grep() and strsplit() use different regex engines? [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2015-Jul-11 22:07 UTC

[R] Do grep() and strsplit() use different regex engines?

David/Jeff:

Thank you both.

You seem to confirm that my observation of an "infelicity" in
strsplit() is real. That is most helpful.

I found nothing in David's message 2 code that was surprising. That
is, the splits shown conform to what I would expect from "\\b" . But
not to what I originally showed and David enlarged upon in his first
message. I still don't really get why a split should occur at every
letter.

Jeff may very well have found the explanation, but I have not gone
through his code.

If the infelicities noted (are there more?) by David and me are not
really bugs -- and I would be frankly surprised if they were -- I
would suggest that perhaps they deserve mention in the strsplit() man
page. Something to the effect that "\b and \< should not be used as
split characters..." .

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
<dwinsemius at comcast.net> wrote:>
> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>
>> I noticed the following:
>>
>>> strsplit("red green","\\b")
>> [[1]]
>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>
> After reading the ?regex help page, I didn't understand why `\b` would
split within sequences of "word"-characters, either. I expected this
to be the result:
>
> [[1]]
> [1] "red"  " "  "green"
>
> There is a warning in that paragraph: "(The interpretation of ?word?
depends on the locale and implementation.)"
>
> I got the expected result with only one of "\\>" and
"\\<"
>
>> strsplit("red green","\\<")
> [[1]]
> [1] "r" "e" "d" " " "g"
"r" "e" "e" "n"
>
>> strsplit("red green","\\>")
> [[1]]
> [1] "red"    " green"
>
> The result with "\\<" seems decidedly unexpected.
>
> I'm wondered if the "original" regex documentation uses the
same language as the R help page. So I went to the cited website and find:
> ======> An assertion-character can be any of the following:
>
>         ? < ? Beginning of word
>         ? > ? End of word
>         ? b ? Word boundary
>         ? B ? Non-word boundary
>         ? d ? Digit character (equivalent to [[:digit:]])
>         ? D ? Non-digit character (equivalent to [^[:digit:]])
>         ? s ? Space character (equivalent to [[:space:]])
>         ? S ? Non-space character (equivalent to [^[:space:]])
>         ? w ? Word character (equivalent to [[:alnum:]_])
>         ? W ? Non-word character (equivalent to [^[:alnum:]_])
> =======>
> The word-"word" appears nowhere else on that page.
>
>
>>> strsplit("red green","\\W")
>> [[1]]
>> [1] "red"   "green"
>
> `\W` matches the byte-width non-word characters. So the "
"-character would be discarded.
>
>>
>> I would have thought that "\\b" should give what
"\\W" did. Note that:
>>
>>> grep("\\bred\\b","red green")
>> [1] 1
>> ## as expected
>>
>> Does strsplit use a different regex engine than grep()? Or more
>> likely, what am I misunderstanding?
>>
>> Thanks.
>>
>> Bert
>>
>>
>
>
> David Winsemius
> Alameda, CA, USA
>

David Winsemius

2015-Jul-11 22:31 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote:
> David/Jeff:
> 
> Thank you both.
> 
> You seem to confirm that my observation of an "infelicity" in
> strsplit() is real. That is most helpful.
> 
> I found nothing in David's message 2 code that was surprising. That
> is, the splits shown conform to what I would expect from "\\b" .
But
> not to what I originally showed and David enlarged upon in his first
> message. I still don't really get why a split should occur at every
> letter.
> 
> Jeff may very well have found the explanation, but I have not gone
> through his code.
> 
> If the infelicities noted (are there more?) by David and me are not
> really bugs -- and I would be frankly surprised if they were -- I
> would suggest that perhaps they deserve mention in the strsplit() man
> page. Something to the effect that "\b and \< should not be used as
> split characters..." .
It's more of a regex infelicity or what appears (to us both at a minimum) 
as a violation of a 'least surprise principle':
>  gsub("\\b", " ", "  This is a test case")[1] "     T h i s   i s   a   t e s t   c a s e "


-- 
David.
 > 
> Bert Gunter
> 
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>   -- Clifford Stoll
> 
> 
> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>> 
>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>> 
>>> I noticed the following:
>>> 
>>>> strsplit("red green","\\b")
>>> [[1]]
>>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>> 
>> After reading the ?regex help page, I didn't understand why `\b`
would split within sequences of "word"-characters, either. I expected
this to be the result:
>> 
>> [[1]]
>> [1] "red"  " "  "green"
>> 
>> There is a warning in that paragraph: "(The interpretation of
?word? depends on the locale and implementation.)"
>> 
>> I got the expected result with only one of "\\>" and
"\\<"
>> 
>>> strsplit("red green","\\<")
>> [[1]]
>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>> 
>>> strsplit("red green","\\>")
>> [[1]]
>> [1] "red"    " green"
>> 
>> The result with "\\<" seems decidedly unexpected.
>> 
>> I'm wondered if the "original" regex documentation uses
the same language as the R help page. So I went to the cited website and find:
>> ======>> An assertion-character can be any of the following:
>> 
>>        ? < ? Beginning of word
>>        ? > ? End of word
>>        ? b ? Word boundary
>>        ? B ? Non-word boundary
>>        ? d ? Digit character (equivalent to [[:digit:]])
>>        ? D ? Non-digit character (equivalent to [^[:digit:]])
>>        ? s ? Space character (equivalent to [[:space:]])
>>        ? S ? Non-space character (equivalent to [^[:space:]])
>>        ? w ? Word character (equivalent to [[:alnum:]_])
>>        ? W ? Non-word character (equivalent to [^[:alnum:]_])
>> =======>> 
>> The word-"word" appears nowhere else on that page.
>> 
>> 
>>>> strsplit("red green","\\W")
>>> [[1]]
>>> [1] "red"   "green"
>> 
>> `\W` matches the byte-width non-word characters. So the "
"-character would be discarded.
>> 
>>> 
>>> I would have thought that "\\b" should give what
"\\W" did. Note that:
>>> 
>>>> grep("\\bred\\b","red green")
>>> [1] 1
>>> ## as expected
>>> 
>>> Does strsplit use a different regex engine than grep()? Or more
>>> likely, what am I misunderstanding?
>>> 
>>> Thanks.
>>> 
>>> Bert
>>> 
>>> 
>> 
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 
David Winsemius
Alameda, CA, USA

Bert Gunter

2015-Jul-11 23:12 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

omigosh -- you're right.

-- Bert
Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 3:31 PM, David Winsemius <dwinsemius at
comcast.net> wrote:>
> On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote:
>
>> David/Jeff:
>>
>> Thank you both.
>>
>> You seem to confirm that my observation of an "infelicity" in
>> strsplit() is real. That is most helpful.
>>
>> I found nothing in David's message 2 code that was surprising. That
>> is, the splits shown conform to what I would expect from
"\\b" . But
>> not to what I originally showed and David enlarged upon in his first
>> message. I still don't really get why a split should occur at every
>> letter.
>>
>> Jeff may very well have found the explanation, but I have not gone
>> through his code.
>>
>> If the infelicities noted (are there more?) by David and me are not
>> really bugs -- and I would be frankly surprised if they were -- I
>> would suggest that perhaps they deserve mention in the strsplit() man
>> page. Something to the effect that "\b and \< should not be
used as
>> split characters..." .
>
> It's more of a regex infelicity or what appears (to us both at a
minimum)  as a violation of a 'least surprise principle':
>
>>  gsub("\\b", " ", "  This is a test
case")
> [1] "     T h i s   i s   a   t e s t   c a s e "
>
>
> --
> David.
>
>>
>> Bert Gunter
>>
>> "Data is not information. Information is not knowledge. And
knowledge
>> is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>
>> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
>> <dwinsemius at comcast.net> wrote:
>>>
>>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>>>
>>>> I noticed the following:
>>>>
>>>>> strsplit("red green","\\b")
>>>> [[1]]
>>>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>>>
>>> After reading the ?regex help page, I didn't understand why
`\b` would split within sequences of "word"-characters, either. I
expected this to be the result:
>>>
>>> [[1]]
>>> [1] "red"  " "  "green"
>>>
>>> There is a warning in that paragraph: "(The interpretation of
?word? depends on the locale and implementation.)"
>>>
>>> I got the expected result with only one of "\\>" and
"\\<"
>>>
>>>> strsplit("red green","\\<")
>>> [[1]]
>>> [1] "r" "e" "d" " "
"g" "r" "e" "e" "n"
>>>
>>>> strsplit("red green","\\>")
>>> [[1]]
>>> [1] "red"    " green"
>>>
>>> The result with "\\<" seems decidedly unexpected.
>>>
>>> I'm wondered if the "original" regex documentation
uses the same language as the R help page. So I went to the cited website and
find:
>>> ======>>> An assertion-character can be any of the
following:
>>>
>>>        ? < ? Beginning of word
>>>        ? > ? End of word
>>>        ? b ? Word boundary
>>>        ? B ? Non-word boundary
>>>        ? d ? Digit character (equivalent to [[:digit:]])
>>>        ? D ? Non-digit character (equivalent to [^[:digit:]])
>>>        ? s ? Space character (equivalent to [[:space:]])
>>>        ? S ? Non-space character (equivalent to [^[:space:]])
>>>        ? w ? Word character (equivalent to [[:alnum:]_])
>>>        ? W ? Non-word character (equivalent to [^[:alnum:]_])
>>> =======>>>
>>> The word-"word" appears nowhere else on that page.
>>>
>>>
>>>>> strsplit("red green","\\W")
>>>> [[1]]
>>>> [1] "red"   "green"
>>>
>>> `\W` matches the byte-width non-word characters. So the "
"-character would be discarded.
>>>
>>>>
>>>> I would have thought that "\\b" should give what
"\\W" did. Note that:
>>>>
>>>>> grep("\\bred\\b","red green")
>>>> [1] 1
>>>> ## as expected
>>>>
>>>> Does strsplit use a different regex engine than grep()? Or more
>>>> likely, what am I misunderstanding?
>>>>
>>>> Thanks.
>>>>
>>>> Bert
>>>>
>>>>
>>>
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>
> David Winsemius
> Alameda, CA, USA
>

Charles C. Berry

2015-Jul-11 23:26 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

On Sat, 11 Jul 2015, Bert Gunter wrote:
> David/Jeff:
>
> Thank you both.
>
> You seem to confirm that my observation of an "infelicity" in
> strsplit() is real. That is most helpful.
>
> I found nothing in David's message 2 code that was surprising. That
> is, the splits shown conform to what I would expect from "\\b" .
But
> not to what I originally showed and David enlarged upon in his first
> message. I still don't really get why a split should occur at every
> letter.
>
> Jeff may very well have found the explanation, but I have not gone
> through his code.
>
> If the infelicities noted (are there more?) by David and me are not
> really bugs -- and I would be frankly surprised if they were -- I
> would suggest that perhaps they deserve mention in the strsplit() man
> page. Something to the effect that "\b and \< should not be used as
> split characters..." .
Bert et al,

?strsplit already says:

"If empty matches occur, in particular if split has length 0, x is split 
into single characters."

And there are various ways that empty matches can happen besides using 
"\\b" as the split arg. But there would be no harm in adding your
cases to
'in particular ...'

The comment in the code (src/main/grep.c: line 493) suggests this was a 
deliberate decision. However, similar functions in other languages do not 
do this.

For example, emacs `(split-string "red green" "\\b")'
gives

 	("" "red" " " "green" "")

as the result.

Chuck

Bert Gunter

2015-Jul-12 02:09 UTC

head link

[R] Do grep() and strsplit() use different regex engines?

Thanks, Chuck (he says, red-faced).

Maybe I should read the man page more carefully ...!

And as for grep(), similar issues: (from ?grep)

"POSIX 1003.2 mode of gsub and gregexpr does not work correctly with
repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for
such matches (but that may not work as expected with non-ASCII inputs,
as the meaning of ?word? is system-dependent)."

And no, I don't think anything needs to be added to ?strsplit. The man
page writers spelled it out clearly. They're not responsible for my
dummheit.

My apologies to all for wasted bandwidth...


Cheers,
Bert

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 4:26 PM, Charles C. Berry <ccberry at ucsd.edu>
wrote:> On Sat, 11 Jul 2015, Bert Gunter wrote:
>
>> David/Jeff:
>>
>> Thank you both.
>>
>> You seem to confirm that my observation of an "infelicity" in
>> strsplit() is real. That is most helpful.
>>
>> I found nothing in David's message 2 code that was surprising. That
>> is, the splits shown conform to what I would expect from
"\\b" . But
>> not to what I originally showed and David enlarged upon in his first
>> message. I still don't really get why a split should occur at every
>> letter.
>>
>> Jeff may very well have found the explanation, but I have not gone
>> through his code.
>>
>> If the infelicities noted (are there more?) by David and me are not
>> really bugs -- and I would be frankly surprised if they were -- I
>> would suggest that perhaps they deserve mention in the strsplit() man
>> page. Something to the effect that "\b and \< should not be
used as
>> split characters..." .
>
>
> Bert et al,
>
> ?strsplit already says:
>
> "If empty matches occur, in particular if split has length 0, x is
split
> into single characters."
>
> And there are various ways that empty matches can happen besides using
"\\b"
> as the split arg. But there would be no harm in adding your cases to
'in
> particular ...'
>
> The comment in the code (src/main/grep.c: line 493) suggests this was a
> deliberate decision. However, similar functions in other languages do not
do
> this.
>
> For example, emacs `(split-string "red green"
"\\b")' gives
>
>         ("" "red" " " "green"
"")
>
> as the result.
>
> Chuck

R help - Jul 2015 - Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?

[R] Do grep() and strsplit() use different regex engines?