thr3ads.net - R help - [R] Minimal match to regexp? [Jan 2023]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2023-Jan-26 00:57 UTC

[R] Minimal match to regexp?

Thanks for pointing out my mistake.  I oversimplified the real problem.

I'll try to post a version of it that comes closer:  Suppose I have a 
string like this:

x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"

If I cat() it, I see that it is really markdown source:

   ```html
   blah blah
   ```

   ```r
   blah blah
   ```

I want to find the part that includes the html block, but not the r 
block.  So I want to match "```html", followed by a minimal number of 
characters, then "```".  Then this pattern works:

   pattern <- "\n```html\n.*?\n```\n"

and we get the right answer:

   cat(regmatches(x, regexpr(pattern, x)))

   ```html
   blah blah
   ```

Okay, but this flavour of markdown says there can be more backticks, not 
just 3.  So the block might look like

   ````html
   blah blah
   ````

I need to have the same number of backticks in the opening and closing 
marker.  So I make the pattern more complicated, and it doesn't work:

   pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

This matches all of x:

   > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
   > cat(regmatches(x, regexpr(pattern2, x)))

   ```html
   blah blah
   ```

   ```r
   blah blah
   ```


Is that a bug, or am I making a silly mistake again?

Duncan Murdoch



On 25/01/2023 7:34 p.m., Andrew Simmons wrote:> grep(value = TRUE) just returns the strings which match the pattern. You 
> have to use regexpr() or gregexpr() if you want to know where the 
> matches are:
> 
> ```
> x <- "abaca"
> 
> # extract only the first match with?regexpr()
> m <- regexpr("a.*?a", x)
> regmatches(x, m)
> 
> # or
> 
> # extract every match with gregexpr()
> m <- gregexpr("a.*?a", x)
> regmatches(x, m)
> ```
> 
> You could also use sub() to remove the rest of the string: 
> `sub("^.*(a.*?a).*$", "\\1", x)`
> keeping only the match within the parenthesis.
> 
> 
> On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.duncan at gmail.com 
> <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     The docs for ?regexp say this:? "By default repetition is greedy,
so
>     the
>     maximal possible number of repeats is used. This can be changed to
>     ?minimal? by appending ? to the quantifier. (There are further
>     quantifiers that allow approximate matching: see the TRE
>     documentation.)"
> 
>     I want the minimal match, but I don't seem to be getting it.? For
>     example,
> 
>     x <- "abaca"
>     grep("a.*?a", x, value = TRUE)
>     #> [1] "abaca"
> 
>     Shouldn't I have gotten "aba", which is the first match
to "a.*a"?? If
>     not, what would be the regexp that would give me the first match to
>     "a.*a", without greedy expansion of the .*?
> 
>     Duncan Murdoch
> 
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>

Andrew Simmons

2023-Jan-26 01:38 UTC

head link

[R] Minimal match to regexp?

It seems like a bug to me. Using perl = TRUE, I see the desired result:

```
x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"

pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

cat(regmatches(x, regexpr(pattern2, x, perl = TRUE)))
```

If you change it to something like:

```
x <- c(
    "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n",
    "\n```html\nblah blah \n```\n"
)

pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

print(regmatches(x, regexpr(pattern2, x)), width = 10)
```

you can see that it does find the match, so the combination of *? and
\\1 must be messing up regexpr(). They seem to work perfectly fine on
their own.

On Wed, Jan 25, 2023 at 7:57 PM Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:>
> Thanks for pointing out my mistake.  I oversimplified the real problem.
>
> I'll try to post a version of it that comes closer:  Suppose I have a
> string like this:
>
> x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"
>
> If I cat() it, I see that it is really markdown source:
>
>    ```html
>    blah blah
>    ```
>
>    ```r
>    blah blah
>    ```
>
> I want to find the part that includes the html block, but not the r
> block.  So I want to match "```html", followed by a minimal
number of
> characters, then "```".  Then this pattern works:
>
>    pattern <- "\n```html\n.*?\n```\n"
>
> and we get the right answer:
>
>    cat(regmatches(x, regexpr(pattern, x)))
>
>    ```html
>    blah blah
>    ```
>
> Okay, but this flavour of markdown says there can be more backticks, not
> just 3.  So the block might look like
>
>    ````html
>    blah blah
>    ````
>
> I need to have the same number of backticks in the opening and closing
> marker.  So I make the pattern more complicated, and it doesn't work:
>
>    pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
>
> This matches all of x:
>
>    > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
>    > cat(regmatches(x, regexpr(pattern2, x)))
>
>    ```html
>    blah blah
>    ```
>
>    ```r
>    blah blah
>    ```
>
>
> Is that a bug, or am I making a silly mistake again?
>
> Duncan Murdoch
>
>
>
> On 25/01/2023 7:34 p.m., Andrew Simmons wrote:
> > grep(value = TRUE) just returns the strings which match the pattern.
You
> > have to use regexpr() or gregexpr() if you want to know where the
> > matches are:
> >
> > ```
> > x <- "abaca"
> >
> > # extract only the first match with regexpr()
> > m <- regexpr("a.*?a", x)
> > regmatches(x, m)
> >
> > # or
> >
> > # extract every match with gregexpr()
> > m <- gregexpr("a.*?a", x)
> > regmatches(x, m)
> > ```
> >
> > You could also use sub() to remove the rest of the string:
> > `sub("^.*(a.*?a).*$", "\\1", x)`
> > keeping only the match within the parenthesis.
> >
> >
> > On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     The docs for ?regexp say this:  "By default repetition is
greedy, so
> >     the
> >     maximal possible number of repeats is used. This can be changed to
> >     ?minimal? by appending ? to the quantifier. (There are further
> >     quantifiers that allow approximate matching: see the TRE
> >     documentation.)"
> >
> >     I want the minimal match, but I don't seem to be getting it. 
For
> >     example,
> >
> >     x <- "abaca"
> >     grep("a.*?a", x, value = TRUE)
> >     #> [1] "abaca"
> >
> >     Shouldn't I have gotten "aba", which is the first
match to "a.*a"?  If
> >     not, what would be the regexp that would give me the first match
to
> >     "a.*a", without greedy expansion of the .*?
> >
> >     Duncan Murdoch
> >
> >     ______________________________________________
> >     R-help at r-project.org <mailto:R-help at r-project.org>
mailing list --
> >     To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     <https://stat.ethz.ch/mailman/listinfo/r-help>
> >     PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     <http://www.R-project.org/posting-guide.html>
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>

R help - Jan 2023 - Minimal match to regexp?

[R] Minimal match to regexp?

[R] Minimal match to regexp?