thr3ads.net - R help - [R] Regex Split? [May 2023]

If this information is useful, please help other people find it:
Share via:

Leonard Mada

2023-May-05 21:53 UTC

[R] Regex Split?

Dear Avi,

Punctuation marks are used in various NLP language models. Preserving 
the "," is therefore useful in such scenarios and Regex are useful to 
accomplish this (especially if you have sufficient experience with such 
expressions).

I observed only an odd behaviour using strsplit: the example string is 
constructed; but it is always wise to test a Regex expression against 
various scenarios. It is usually hard to predict what special cases will 
occur in a specific corpus.

strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])",
perl=T)
# "a"? "bc"? ","? "def"? ","?
""? "adef"? ","? ","? "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex="
|(?=,)|(?<=,)(?![ ])")
# "a"??? "bc"?? ","??? "def"?
","??? "adef"? ""???? ","???
"," "gh"

stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<! 
)(?=,)|(?<=,)(?![ ])")
# "a"??? "bc"?? ","??? "def"?
","??? "adef"? ","??? ","???
"gh"

# Expected:
# "a"? "bc" ? ","? "def" ?
","? "adef"? "," ? ","? "gh"
# see 2nd instance of stringi::stri_split


Sincerely,


Leonard


On 5/5/2023 11:20 PM, avi.e.gross at gmail.com wrote:> Leonard,
>
> It can be helpful to spell out your intent in English or some of us have to
go back to the documentation to remember what some of the operators do.
>
> Your text being searched seems to be an example of items between comas with
an optional space after some commas and in one case, nothing between commas.
>
> So what is your goal for the example, and in general? You mention a bit
unclearly at the end some of what you expect and I think it would be clearer if
you also showed exactly the output you would want.
>
> I saw some other replies that addressed what you wanted and am going to
reply in another direction.
>
> Why do things the hard way using things like lookahead or look behind?
Would several steps get you the result way more clearly?
>
> For the sake of argument, you either want what reading in a CSV file would
supply, or something else. Since you are not simply splitting on commas, it
sounds like something else. But what exactly else? Something as simple as this
on just a comma produces results including empty strings and embedded leading or
trailing spaces:
>
> strsplit("a bc,def, adef ,,gh", ",")
> [[1]]
> [1] "a bc"   "def"    " adef " ""  
"gh"
>
> That can of course be handled by, for example, trimming the result after
unlisting the odd way strsplit returns results:
>
> library("stringr")
> str_squish(unlist(strsplit("a bc,def, adef ,,gh",
",")))
>
> [1] "a bc" "def"  "adef" ""    
"gh"
>
> Now do you want the empty string to be something else, such as an NA? That
can be done too with another step.
>
> And a completely different variant can be used to read in your one-line CSV
as text using standard overkill tools:
>
>> read.table(text="a bc,def, adef ,,gh", sep=",")
>      V1  V2     V3 V4 V5
> 1 a bc def  adef  NA gh
>
> The above is a vector of texts. But if you simply want to reassemble your
initial string cleaned up a bit, you can use paste to put back commas, as in a
variation of the earlier example:
>
>> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh",
","))), collapse=",")
> [1] "a bc,def,adef,,gh"
>
> So my question is whether using advanced methods is really necessary for
your case, or even particularly efficient. If efficiency matters, often, it is
better to use tools without regular expressions such as paste0() when they meet
your needs.
>
> Of course, unless I know what you are actually trying to do, my remarks may
be not useful.
>
>
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Leonard
Mada via R-help
> Sent: Thursday, May 4, 2023 5:00 PM
> To: R-help Mailing List <r-help at r-project.org>
> Subject: [R] Regex Split?
>
> Dear R-Users,
>
> I tried the following 3 Regex expressions in R 4.3:
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![
])", perl=T)
> # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?=[^ ])", perl=T)
> # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
>
>
> Is this correct?
>
>
> I feel that:
> - none should return (after "def"): ",", "";
> - the first one could also return "", "," (but probably
not; not fully
> sure about this);
>
>
> Sincerely,
>
>
> Leonard
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
> PLEASE do read the posting guide
https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2023-May-05 22:35 UTC

head link

[R] Regex Split?

Primarily for my own amusement, here is a way to do what I think you wanted
without look-aheads/behinds

strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def,
adef,x; ,,gh"), " +")
[[1]]
 [1] "a"    "bc"   ","    "def" 
","    "adef" ","    "x"   
";"
[10] ","    ","    "gh"

I certainly would *not* claim that it is in any way superior to anything
that has already been suggested -- indeed, probably the contrary. But it's
simple (as am I).

Cheers,
Bert

On Fri, May 5, 2023 at 2:54?PM Leonard Mada via R-help <r-help at
r-project.org>
wrote:
> Dear Avi,
>
> Punctuation marks are used in various NLP language models. Preserving
> the "," is therefore useful in such scenarios and Regex are
useful to
> accomplish this (especially if you have sufficient experience with such
> expressions).
>
> I observed only an odd behaviour using strsplit: the example string is
> constructed; but it is always wise to test a Regex expression against
> various scenarios. It is usually hard to predict what special cases will
> occur in a specific corpus.
>
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![
])", perl=T)
> # "a"  "bc"  ","  "def" 
","  ""  "adef"  ","  "," 
"gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex="
|(?=,)|(?<=,)(?![ ])")
> # "a"    "bc"   ","    "def" 
","    "adef"  ""     ","   
"," "gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<!
> )(?=,)|(?<=,)(?![ ])")
> # "a"    "bc"   ","    "def" 
","    "adef"  ","    ","   
"gh"
>
> # Expected:
> # "a"  "bc"   ","  "def"  
","  "adef"  ","   ","  "gh"
> # see 2nd instance of stringi::stri_split
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 5/5/2023 11:20 PM, avi.e.gross at gmail.com wrote:
> > Leonard,
> >
> > It can be helpful to spell out your intent in English or some of us
have
> to go back to the documentation to remember what some of the operators do.
> >
> > Your text being searched seems to be an example of items between comas
> with an optional space after some commas and in one case, nothing between
> commas.
> >
> > So what is your goal for the example, and in general? You mention a
bit
> unclearly at the end some of what you expect and I think it would be
> clearer if you also showed exactly the output you would want.
> >
> > I saw some other replies that addressed what you wanted and am going
to
> reply in another direction.
> >
> > Why do things the hard way using things like lookahead or look behind?
> Would several steps get you the result way more clearly?
> >
> > For the sake of argument, you either want what reading in a CSV file
> would supply, or something else. Since you are not simply splitting on
> commas, it sounds like something else. But what exactly else? Something as
> simple as this on just a comma produces results including empty strings and
> embedded leading or trailing spaces:
> >
> > strsplit("a bc,def, adef ,,gh", ",")
> > [[1]]
> > [1] "a bc"   "def"    " adef "
""       "gh"
> >
> > That can of course be handled by, for example, trimming the result
after
> unlisting the odd way strsplit returns results:
> >
> > library("stringr")
> > str_squish(unlist(strsplit("a bc,def, adef ,,gh",
",")))
> >
> > [1] "a bc" "def"  "adef" ""   
"gh"
> >
> > Now do you want the empty string to be something else, such as an NA?
> That can be done too with another step.
> >
> > And a completely different variant can be used to read in your
one-line
> CSV as text using standard overkill tools:
> >
> >> read.table(text="a bc,def, adef ,,gh",
sep=",")
> >      V1  V2     V3 V4 V5
> > 1 a bc def  adef  NA gh
> >
> > The above is a vector of texts. But if you simply want to reassemble
> your initial string cleaned up a bit, you can use paste to put back commas,
> as in a variation of the earlier example:
> >
> >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh",
","))),
> collapse=",")
> > [1] "a bc,def,adef,,gh"
> >
> > So my question is whether using advanced methods is really necessary
for
> your case, or even particularly efficient. If efficiency matters, often, it
> is better to use tools without regular expressions such as paste0() when
> they meet your needs.
> >
> > Of course, unless I know what you are actually trying to do, my
remarks
> may be not useful.
> >
> >
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces at r-project.org> On Behalf Of
Leonard Mada
> via R-help
> > Sent: Thursday, May 4, 2023 5:00 PM
> > To: R-help Mailing List <r-help at r-project.org>
> > Subject: [R] Regex Split?
> >
> > Dear R-Users,
> >
> > I tried the following 3 Regex expressions in R 4.3:
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![
])", perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?=[^ ])", perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> >
> > Is this correct?
> >
> >
> > I feel that:
> > - none should return (after "def"): ",",
"";
> > - the first one could also return "", "," (but
probably not; not fully
> > sure about this);
> >
> >
> > Sincerely,
> >
> >
> > Leonard
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >
>
https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
> > PLEASE do read the posting guide
>
https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - May 2023 - Regex Split?

[R] Regex Split?

[R] Regex Split?