Suharto Anggono Suharto Anggono
2010-Jul-08 08:15 UTC
[R] strsplit("dia ma", "\\b") splits characterwise
\b is word boundary. But, unexpectedly, strsplit("dia ma", "\\b") splits character by character.> strsplit("dia ma", "\\b")[[1]] [1] "d" "i" "a" " " "m" "a"> strsplit("dia ma", "\\b", perl=TRUE)[[1]] [1] "d" "i" "a" " " "m" "a" How can that be? This is the output of 'gregexpr'.> gregexpr("\\b", "dia ma")[[1]] [1] 1 2 3 4 5 6 attr(,"match.length") [1] 0 0 0 0 0 0> gregexpr("\\b", "dia ma", perl=TRUE)[[1]] [1] 1 4 5 7 attr(,"match.length") [1] 0 0 0 0 The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 'strsplit' to split at that points. This is in Windows. R was installed from binary.> sessionInfo()R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base R 2.8.1 shows the same 'strsplit' behavior, but the behavior of default 'gregexpr' (i.e. perl=FALSE) is different.> strsplit("dia ma", "\\b")[[1]] [1] "d" "i" "a" " " "m" "a"> strsplit("dia ma", "\\b", perl=TRUE)[[1]] [1] "d" "i" "a" " " "m" "a"> gregexpr("\\b", "dia ma")[[1]] [1] 1 4 5 7 attr(,"match.length") [1] 0 0 0 0> gregexpr("\\b", "dia ma", perl=TRUE)[[1]] [1] 1 4 5 7 attr(,"match.length") [1] 0 0 0 0> sessionInfo()R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base
l guess this is expected behaviour, although counterintuitive. \b represents an empty string indicating a word boundary, but is coerced to character and thus simply the empty string. This means the output you get is the same as> strsplit("dia ma", "",perl=T)[[1]] [1] "d" "i" "a" " " "m" "a" I'd use the seperating character as split in strsplit, eg> strsplit("dia ma", "\\s")[[1]] [1] "dia" "ma" If you need the space in the list as well, you'll have to go around it I guess.> test <- as.vector(gregexpr("\\b", "dia ma", perl=TRUE)[[1]]) > test[1] 1 4 5 7> apply(embed(test,2),1,function(x) substr("dia ma",x[2],x[1]-1))[1] "dia" " " "ma" It would be nice if special characters like \b would be recognized by strsplit as well though. Cheers Joris On Thu, Jul 8, 2010 at 10:15 AM, Suharto Anggono Suharto Anggono <suharto_anggono at yahoo.com> wrote:> \b is word boundary. > But, unexpectedly, strsplit("dia ma", "\\b") splits character by character. > >> strsplit("dia ma", "\\b") > [[1]] > [1] "d" "i" "a" " " "m" "a" > >> strsplit("dia ma", "\\b", perl=TRUE) > [[1]] > [1] "d" "i" "a" " " "m" "a" > > > How can that be? > > This is the output of 'gregexpr'. > >> gregexpr("\\b", "dia ma") > [[1]] > [1] 1 2 3 4 5 6 > attr(,"match.length") > [1] 0 0 0 0 0 0 > >> gregexpr("\\b", "dia ma", perl=TRUE) > [[1]] > [1] 1 4 5 7 > attr(,"match.length") > [1] 0 0 0 0 > > > The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 'strsplit' to split at that points. > > This is in Windows. R was installed from binary. > >> sessionInfo() > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > > > R 2.8.1 shows the same 'strsplit' behavior, but the behavior of default 'gregexpr' (i.e. perl=FALSE) is different. > >> strsplit("dia ma", "\\b") > [[1]] > [1] "d" "i" "a" " " "m" "a" > >> strsplit("dia ma", "\\b", perl=TRUE) > [[1]] > [1] "d" "i" "a" " " "m" "a" > >> gregexpr("\\b", "dia ma") > [[1]] > [1] 1 4 5 7 > attr(,"match.length") > [1] 0 0 0 0 > >> gregexpr("\\b", "dia ma", perl=TRUE) > [[1]] > [1] 1 4 5 7 > attr(,"match.length") > [1] 0 0 0 0 > >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON > ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
Gabor Grothendieck
2010-Jul-08 13:33 UTC
[R] strsplit("dia ma", "\\b") splits characterwise
On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono <suharto_anggono at yahoo.com> wrote:> \b is word boundary. > But, unexpectedly, strsplit("dia ma", "\\b") splits character by character. > >> strsplit("dia ma", "\\b") > [[1]] > [1] "d" "i" "a" " " "m" "a" > >> strsplit("dia ma", "\\b", perl=TRUE) > [[1]] > [1] "d" "i" "a" " " "m" "a" > > > How can that be? > > This is the output of 'gregexpr'. > >> gregexpr("\\b", "dia ma") > [[1]] > [1] 1 2 3 4 5 6 > attr(,"match.length") > [1] 0 0 0 0 0 0 > >> gregexpr("\\b", "dia ma", perl=TRUE) > [[1]] > [1] 1 4 5 7 > attr(,"match.length") > [1] 0 0 0 0 > > > The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 'strsplit' to split at that points.You can use strapply in the gsubfn function to match all words and non-words: library(gsubfn) strapply("dia ma", "\\w+|\\W+", c) # c("dia", " ", "ma") or all spaces and non-spaces: strapply("dia ma", "\\s+|\\S+", c) # c("dia", " ", "ma")