thr3ads.net - R help - [R] strsplit("dia ma", "\\b") splits characterwise [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Suharto Anggono Suharto Anggono

2010-Jul-08 08:15 UTC

[R] strsplit("dia ma", "\\b") splits characterwise

\b is word boundary.
But, unexpectedly, strsplit("dia ma", "\\b") splits
character by character.
> strsplit("dia ma", "\\b")[[1]]
[1] "d" "i" "a" " " "m"
"a"
> strsplit("dia ma", "\\b", perl=TRUE)[[1]]
[1] "d" "i" "a" " " "m"
"a"


How can that be?

This is the output of 'gregexpr'.
> gregexpr("\\b", "dia ma")[[1]]
[1] 1 2 3 4 5 6
attr(,"match.length")
[1] 0 0 0 0 0 0
> gregexpr("\\b", "dia ma", perl=TRUE)[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0


The output from gregexpr("\\b", "dia ma", perl=TRUE) is what
I expect. I expect 'strsplit' to split at that points.

This is in Windows. R was installed from binary.
> sessionInfo()R version 2.11.1 (2010-05-31)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



R 2.8.1 shows the same 'strsplit' behavior, but the behavior of default
'gregexpr' (i.e. perl=FALSE) is different.
> strsplit("dia ma", "\\b")[[1]]
[1] "d" "i" "a" " " "m"
"a"
> strsplit("dia ma", "\\b", perl=TRUE)[[1]]
[1] "d" "i" "a" " " "m"
"a"
> gregexpr("\\b", "dia ma")[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0
> gregexpr("\\b", "dia ma", perl=TRUE)[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0
> sessionInfo()R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON
ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252


attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Joris Meys

2010-Jul-08 13:07 UTC

head link

[R] strsplit("dia ma", "\\b") splits characterwise

l guess this is expected behaviour, although counterintuitive. \b
represents an empty string indicating a word boundary, but is coerced
to character and thus simply the empty string. This means the output
you get is the same as> strsplit("dia ma", "",perl=T)[[1]]
[1] "d" "i" "a" " " "m"
"a"

I'd use the seperating character as split in strsplit, eg
> strsplit("dia ma", "\\s")[[1]]
[1] "dia" "ma"

If you need the space in the list as well, you'll have to go around it I
guess.
> test <- as.vector(gregexpr("\\b", "dia ma",
perl=TRUE)[[1]])
> test
[1] 1 4 5 7> apply(embed(test,2),1,function(x) substr("dia ma",x[2],x[1]-1))[1] "dia" " "   "ma"

It would be nice if special characters like \b would be recognized by
strsplit as well though.

Cheers
Joris

On Thu, Jul 8, 2010 at 10:15 AM, Suharto Anggono Suharto Anggono
<suharto_anggono at yahoo.com> wrote:> \b is word boundary.
> But, unexpectedly, strsplit("dia ma", "\\b") splits
character by character.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>
> How can that be?
>
> This is the output of 'gregexpr'.
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 2 3 4 5 6
> attr(,"match.length")
> [1] 0 0 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>
> The output from gregexpr("\\b", "dia ma", perl=TRUE) is
what I expect. I expect 'strsplit' to split at that points.
>
> This is in Windows. R was installed from binary.
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
>
>
> R 2.8.1 shows the same 'strsplit' behavior, but the behavior of
default 'gregexpr' (i.e. perl=FALSE) is different.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MON
> ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United
States.1252
>
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

Gabor Grothendieck

2010-Jul-08 13:33 UTC

head link

[R] strsplit("dia ma", "\\b") splits characterwise

On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono
<suharto_anggono at yahoo.com> wrote:> \b is word boundary.
> But, unexpectedly, strsplit("dia ma", "\\b") splits
character by character.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m"
"a"
>
>
> How can that be?
>
> This is the output of 'gregexpr'.
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 2 3 4 5 6
> attr(,"match.length")
> [1] 0 0 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>
> The output from gregexpr("\\b", "dia ma", perl=TRUE) is
what I expect. I expect 'strsplit' to split at that points.
You can use strapply in the gsubfn function to match all words and non-words:

library(gsubfn)
strapply("dia ma", "\\w+|\\W+", c)     # c("dia",
" ", "ma")

or all spaces and non-spaces:

strapply("dia ma", "\\s+|\\S+", c)     # c("dia",
" ", "ma")

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jul 2010 - strsplit("dia ma", "\\b") splits characterwise

[R] strsplit("dia ma", "\\b") splits characterwise

[R] strsplit("dia ma", "\\b") splits characterwise

[R] strsplit("dia ma", "\\b") splits characterwise

Possibly Parallel Threads