Dear R-Users, I tried the following 3 Regex expressions in R 4.3: strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T) # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", perl=T) # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" Is this correct? I feel that: - none should return (after "def"): ",", ""; - the first one could also return "", "," (but probably not; not fully sure about this); Sincerely, Leonard
On Thu, 4 May 2023 23:59:33 +0300 Leonard Mada via R-help <r-help at r-project.org> wrote:> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T) > # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", > perl=T) > # "a"??? "bc"?? ","??? "def"? ","??? ""???? "adef" ","??? "," "gh" > > > Is this correct?Perl seems to return the results you expect: $ perl -E ' say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh]) for ( qr[ |(?=,)|(?<=,)(?![ ])], qr[ |(?<! )(?=,)|(?<=,)(?![ ])], qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])] )' (?^u: |(?=,)|(?<=,)(?![ ])): "a" "bc" "," "def" "," "adef" "," "," "gh" (?^u: |(?<! )(?=,)|(?<=,)(?![ ])): "a" "bc" "," "def" "," "adef" "," "," "gh" (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])): "a" "bc" "," "def" "," "adef" "," "," "gh" The same thing happens when I ask R to replace the separators instead of splitting by them: sapply(setNames(nm = c( " |(?=,)|(?<=,)(?![ ])", " |(?<! )(?=,)|(?<=,)(?![ ])", " |(?<! )(?=,)|(?<=,)(?=[^ ])") ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE) # |(?=,)|(?<=,)(?![ ]) |(?<! )(?=,)|(?<=,)(?![ ]) # "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh" # |(?<! )(?=,)|(?<=,)(?=[^ ]) # "a[]bc[],[]def[],[]adef[],[],[]gh" I think that something strange happens when the delimeter pattern matches more than once in the same place: gsub( '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here', perl = TRUE ) # [1] "split here -->[]<-- split here" (Both Perl's split() and s///g agree with R's gsub() here, although I would have accepted "split here -->[][]<-- split here" too.) On the other hand, the following doesn't look right: strsplit( 'split here --><-- split here', '(?=<--)|(?<=-->)', perl = TRUE ) # [[1]] # [1] "split here -->" "<" "-- split here" The "<" is definitely not followed by "<--", and the rightmost "--" is definitely not preceded by "-->". Perhaps strsplit() incorrectly advances the match position after one match? -- Best regards, Ivan
Leonard, It can be helpful to spell out your intent in English or some of us have to go back to the documentation to remember what some of the operators do. Your text being searched seems to be an example of items between comas with an optional space after some commas and in one case, nothing between commas. So what is your goal for the example, and in general? You mention a bit unclearly at the end some of what you expect and I think it would be clearer if you also showed exactly the output you would want. I saw some other replies that addressed what you wanted and am going to reply in another direction. Why do things the hard way using things like lookahead or look behind? Would several steps get you the result way more clearly? For the sake of argument, you either want what reading in a CSV file would supply, or something else. Since you are not simply splitting on commas, it sounds like something else. But what exactly else? Something as simple as this on just a comma produces results including empty strings and embedded leading or trailing spaces: strsplit("a bc,def, adef ,,gh", ",") [[1]] [1] "a bc" "def" " adef " "" "gh" That can of course be handled by, for example, trimming the result after unlisting the odd way strsplit returns results: library("stringr") str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))) [1] "a bc" "def" "adef" "" "gh" Now do you want the empty string to be something else, such as an NA? That can be done too with another step. And a completely different variant can be used to read in your one-line CSV as text using standard overkill tools:> read.table(text="a bc,def, adef ,,gh", sep=",")V1 V2 V3 V4 V5 1 a bc def adef NA gh The above is a vector of texts. But if you simply want to reassemble your initial string cleaned up a bit, you can use paste to put back commas, as in a variation of the earlier example:> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))), collapse=",")[1] "a bc,def,adef,,gh" So my question is whether using advanced methods is really necessary for your case, or even particularly efficient. If efficiency matters, often, it is better to use tools without regular expressions such as paste0() when they meet your needs. Of course, unless I know what you are actually trying to do, my remarks may be not useful. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Leonard Mada via R-help Sent: Thursday, May 4, 2023 5:00 PM To: R-help Mailing List <r-help at r-project.org> Subject: [R] Regex Split? Dear R-Users, I tried the following 3 Regex expressions in R 4.3: strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) # "a" "bc" "," "def" "," "" "adef" "," "," "gh" strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T) # "a" "bc" "," "def" "," "" "adef" "," "," "gh" strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", perl=T) # "a" "bc" "," "def" "," "" "adef" "," "," "gh" Is this correct? I feel that: - none should return (after "def"): ",", ""; - the first one could also return "", "," (but probably not; not fully sure about this); Sincerely, Leonard ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.