thr3ads.net - R devel - [Rd] strsplit and the empty string [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Wacek Kusnierczyk

2008-Jun-18 12:45 UTC

[Rd] strsplit and the empty string

Hello,

I am wondering about the behaviour of strsplit.  When the pattern
matches the beginning of the search string, the mepty string is added to
the result, but that's not the case when the pattern matches the end of
the search string:

strsplit(" hello dolly ")
[1] "" "hello" "dolly"

The man for strsplit explains the algorithm:

"
 The algorithm applied to each input string is


         repeat {
             if the string is empty
                 break.
             if there is a match
                 add the string to the left of the match to the output.
                 remove the match and all to the left of it.
             else
                 add the string to the output.
                 break.
         }

     Note that this means that if there is a match at the beginning of
     a (non-empty) string, the first element of the output is
'""', but
     if there is a match at the end of the string, the output is the
     same as with the match removed.
"

I do not see how this algorithm specifies that there should be no empty
string at the end of the output if the pattern matches the end of the
input string.
If the pattern matches, (second if above), the match is added to the
output, and removed from the input -- which after this step is the empty
string; in the next step, there is no match (else above), so the rest of
the input string (= the empty string) *should* be added, but it is not
what happens. 

I think that the implementation of the algorithm (and the explanation
that "if there is a match at the end of the string, the output is the
same as with the match removed") is both unintuitive (i see no good
reason for including the empty string at the beginning but not at the
end of the output; no other language i know would do that this way) and
actually wrong wrt. the algorithm.

Any opinion?  What was the ground for this design?

vQ




-- 
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD

Email: waku at idi.ntnu.no
Phone: +47 73591875, +47 72574609

Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303

Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060

Christian Brechbühler

2008-Jun-18 14:59 UTC

head link

[Rd] strsplit and the empty string

On Wed, Jun 18, 2008 at 8:45 AM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> asked
for opinions:>
> When the pattern
> matches the beginning of the search string, the empty string is added to
> the result, but that's not the case when the pattern matches the end of
> the search string:
>
> strsplit(" hello dolly ")
> [1] "" "hello" "dolly"
With R version 2.6.1 Patched (2007-11-26 r43541), I get
    Error in strsplit(" hello dolly ") :
      argument "split" is missing, with no default

But strsplit(" hello dolly ", " ") reproduces your results.
> The man for strsplit explains the algorithm:
>
> "
>  The algorithm applied to each input string is
>
>
>         repeat {
>             if the string is empty
>                 break.
>             if there is a match
>                 add the string to the left of the match to the output.
>                 remove the match and all to the left of it.
>             else
>                 add the string to the output.
>                 break.
>         }
>
>     Note that this means that if there is a match at the beginning of
>     a (non-empty) string, the first element of the output is
'""', but
>     if there is a match at the end of the string, the output is the
>     same as with the match removed.
> "
The algorithm, the comment after it, and your results are consistent.
Whether it is intuitive is a matter of taste.  I agree it's not as
symmetric as one might like.
> If the pattern matches, (second if above), the match is added to the
> output, and removed from the input -- which after this step is the empty
> string;
Close.  The string to the left of the match, "dolly", is added to the
output.
I agree, the input is now the empty string.
> in the next step, there is no match (else above), so the rest of
> the input string (= the empty string) *should* be added, but it is not
> what happens.
No, in the next step, the string is empty (first 'if' above), and we
break.
The else branch never applies in your example.
> (i see no good
> reason for including the empty string at the beginning but not at the
> end of the output; no other language i know would do that this way)
I checked Perl, and it does exactly the same:
  print join "==", split / /, " hello dolly "
==hello==dolly
(that's 3 elements: "", "hello",  and
"dolly").

Cheers,
/Christian

Maybe Matching Threads

Search for more reasonably related threads

R devel - Jun 2008 - strsplit and the empty string

[Rd] strsplit and the empty string

[Rd] strsplit and the empty string

Maybe Matching Threads