Hello, I am wondering about the behaviour of strsplit. When the pattern matches the beginning of the search string, the mepty string is added to the result, but that's not the case when the pattern matches the end of the search string: strsplit(" hello dolly ") [1] "" "hello" "dolly" The man for strsplit explains the algorithm: " The algorithm applied to each input string is repeat { if the string is empty break. if there is a match add the string to the left of the match to the output. remove the match and all to the left of it. else add the string to the output. break. } Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is '""', but if there is a match at the end of the string, the output is the same as with the match removed. " I do not see how this algorithm specifies that there should be no empty string at the end of the output if the pattern matches the end of the input string. If the pattern matches, (second if above), the match is added to the output, and removed from the input -- which after this step is the empty string; in the next step, there is no match (else above), so the rest of the input string (= the empty string) *should* be added, but it is not what happens. I think that the implementation of the algorithm (and the explanation that "if there is a match at the end of the string, the output is the same as with the match removed") is both unintuitive (i see no good reason for including the empty string at the beginning but not at the end of the output; no other language i know would do that this way) and actually wrong wrt. the algorithm. Any opinion? What was the ground for this design? vQ -- ------------------------------------------------------------------------------- Wacek Kusnierczyk, MD PhD Email: waku at idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics & Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060
On Wed, Jun 18, 2008 at 8:45 AM, Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> asked for opinions:> > When the pattern > matches the beginning of the search string, the empty string is added to > the result, but that's not the case when the pattern matches the end of > the search string: > > strsplit(" hello dolly ") > [1] "" "hello" "dolly"With R version 2.6.1 Patched (2007-11-26 r43541), I get Error in strsplit(" hello dolly ") : argument "split" is missing, with no default But strsplit(" hello dolly ", " ") reproduces your results.> The man for strsplit explains the algorithm: > > " > The algorithm applied to each input string is > > > repeat { > if the string is empty > break. > if there is a match > add the string to the left of the match to the output. > remove the match and all to the left of it. > else > add the string to the output. > break. > } > > Note that this means that if there is a match at the beginning of > a (non-empty) string, the first element of the output is '""', but > if there is a match at the end of the string, the output is the > same as with the match removed. > "The algorithm, the comment after it, and your results are consistent. Whether it is intuitive is a matter of taste. I agree it's not as symmetric as one might like.> If the pattern matches, (second if above), the match is added to the > output, and removed from the input -- which after this step is the empty > string;Close. The string to the left of the match, "dolly", is added to the output. I agree, the input is now the empty string.> in the next step, there is no match (else above), so the rest of > the input string (= the empty string) *should* be added, but it is not > what happens.No, in the next step, the string is empty (first 'if' above), and we break. The else branch never applies in your example.> (i see no good > reason for including the empty string at the beginning but not at the > end of the output; no other language i know would do that this way)I checked Perl, and it does exactly the same: print join "==", split / /, " hello dolly " ==hello==dolly (that's 3 elements: "", "hello", and "dolly"). Cheers, /Christian