thr3ads.net - R help - [R] regexp mystery [Oct 2018]

If this information is useful, please help other people find it:
Share via:

PIKAL Petr

2018-Oct-16 08:36 UTC

[R] regexp mystery

Dear all

I have text file with lines like this.
> dput(x[9])"PYedehYev:  300 s                    Z?va~?: 2.160
kg"> dput(x[11])"et odYezko: 3                     \fas odYezku:   15 s"

I am able to extract some numbers but others give me headache.

gsub("^.*[^:] (\\d+.\\d+).*$", "\\1", x[9])
works for 300

gsub("^.*[:] (\\d+.\\d+).*$", "\\1", x[9])
works for 2.160

gsub("^.*: (\\d+).*$", "\\1", x[11])
works for 3

but only after many attempts I found that
gsub("^.*[^:] (\\d+).*$", "\\1", x[11])
works for 15

Can somebody explain my why for line 11 second item requires almost equvalent
regular expression as first item in line 9 and why just

gsub("^.*[:] (\\d+).*$", "\\1", x[11])

does not produce 15 but 3???

Cheers
Petr

Osobn? ?daje: Informace o zpracov?n? a ochran? osobn?ch ?daj? obchodn?ch
partner? PRECHEZA a.s. jsou zve?ejn?ny na:
https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about
processing and protection of business partner?s personal data are available on
website: https://www.precheza.cz/en/personal-data-protection-principles/
D?v?rnost: Tento e-mail a jak?koliv k n?mu p?ipojen? dokumenty jsou d?v?rn? a
podl?haj? tomuto pr?vn? z?vazn?mu prohl??en? o vylou?en? odpov?dnosti:
https://www.precheza.cz/01-dovetek/ | This email and any documents attached to
it may be confidential and are subject to the legally binding disclaimer:
https://www.precheza.cz/en/01-disclaimer/

Ivan Krylov

2018-Oct-16 09:08 UTC

head link

[R] regexp mystery

On Tue, 16 Oct 2018 08:36:27 +0000
PIKAL Petr <petr.pikal at precheza.cz> wrote:
> > dput(x[11])  
> "et odYezko: 3                     \fas odYezku:   15 s"
> gsub("^.*: (\\d+).*$", "\\1", x[11])
> works for 3
This regular expression only matches one space between the colon and
the number, but you have more than one of them before "15".
> gsub("^.*[^:] (\\d+).*$", "\\1", x[11])
> works for 15
Match succeeds because a space is not a colon:

 ^.* matches "et odYezko: 3                     \fas odYezku:  "
 [^:] matches space " "
 space " " matches another space " "
 finally, (\\d+) matches "15"
 and .*$ matches " s"

If you need just the numbers, you might have more success by extracting
matches directly with gregexpr and regmatches:

(
	function(s) regmatches(
		s,
		gregexpr("\\d+(\\.\\d+)?", s)
	)
)("et odYezko: 3                     \fas odYezku:   15 s")

[[1]]
[1] "3"  "15"

(I'm creating an anonymous function and evaluating it immediately
because I need to pass the same string to both gregexpr and regmatches.)

If you need to capture numbers appearing in a specific context, a
better regular expression suiting your needs might be

":\\s*(\\d+(?:\\.\\d+)?)"

(A colon, followed by optional whitespace, followed by a number to
capture, consisting of decimals followed by optional, non-captured dot
followed by decimals)

but I couldn't find a way to extract captures from repeated match by
using vanilla R pattern matching (it's either regexec which returns
captures for the first match or gregexpr which returns all matches but
without the captures). If you can load the stringr package, it's very
easy, though:

str_match_all(
	c(
		"PYedehYev:  300 s              Z?va~?: 2.160 kg",
		"et odYezko: 3               \fas odYezku:   15 s"
	),
	":\\s*(\\d+(?:\\.\\d+)?)"
)
[[1]]
     [,1]      [,2]   
[1,] ":  300"  "300"  
[2,] ": 2.160" "2.160"

[[2]]
     [,1]     [,2]
[1,] ": 3"    "3" 
[2,] ":   15" "15"

Column 2 of each list item contains the requested captures.

-- 
Best regards,
Ivan

PIKAL Petr

2018-Oct-16 09:23 UTC

head link

[R] regexp mystery

Hi

Thanks a lot for your insightful answer. I will need to study it in detail,
gregexpr and regexpr seems to be quite handy for what I need.

Cheers
Petr
> -----Original Message-----
> From: Ivan Krylov <krylov.r00t at gmail.com>
> Sent: Tuesday, October 16, 2018 11:08 AM
> To: PIKAL Petr <petr.pikal at precheza.cz>
> Cc: r-help at r-project.org
> Subject: Re: [R] regexp mystery
>
> On Tue, 16 Oct 2018 08:36:27 +0000
> PIKAL Petr <petr.pikal at precheza.cz> wrote:
>
> > > dput(x[11])
> > "et odYezko: 3                     \fas odYezku:   15 s"
>
> > gsub("^.*: (\\d+).*$", "\\1", x[11])
> > works for 3
>
> This regular expression only matches one space between the colon and the
> number, but you have more than one of them before "15".
>
> > gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) works for
15
>
> Match succeeds because a space is not a colon:
>
>  ^.* matches "et odYezko: 3                     \fas odYezku:  "
>  [^:] matches space " "
>  space " " matches another space " "
>  finally, (\\d+) matches "15"
>  and .*$ matches " s"
>
> If you need just the numbers, you might have more success by extracting
> matches directly with gregexpr and regmatches:
>
> (
> function(s) regmatches(
> s,
> gregexpr("\\d+(\\.\\d+)?", s)
> )
> )("et odYezko: 3                     \fas odYezku:   15 s")
>
> [[1]]
> [1] "3"  "15"
>
> (I'm creating an anonymous function and evaluating it immediately
because I
> need to pass the same string to both gregexpr and regmatches.)
>
> If you need to capture numbers appearing in a specific context, a better
regular
> expression suiting your needs might be
>
> ":\\s*(\\d+(?:\\.\\d+)?)"
>
> (A colon, followed by optional whitespace, followed by a number to capture,
> consisting of decimals followed by optional, non-captured dot followed by
> decimals)
>
> but I couldn't find a way to extract captures from repeated match by
using
> vanilla R pattern matching (it's either regexec which returns captures
for the
> first match or gregexpr which returns all matches but without the
captures). If
> you can load the stringr package, it's very easy, though:
>
> str_match_all(
> c(
> "PYedehYev:  300 s              Z?va~?: 2.160 kg",
> "et odYezko: 3               \fas odYezku:   15 s"
> ),
> ":\\s*(\\d+(?:\\.\\d+)?)"
> )
> [[1]]
>      [,1]      [,2]
> [1,] ":  300"  "300"
> [2,] ": 2.160" "2.160"
>
> [[2]]
>      [,1]     [,2]
> [1,] ": 3"    "3"
> [2,] ":   15" "15"
>
> Column 2 of each list item contains the requested captures.
>
> --
> Best regards,
> IvanOsobn? ?daje: Informace o zpracov?n? a ochran? osobn?ch ?daj? obchodn?ch
partner? PRECHEZA a.s. jsou zve?ejn?ny na:
https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about
processing and protection of business partner?s personal data are available on
website: https://www.precheza.cz/en/personal-data-protection-principles/
D?v?rnost: Tento e-mail a jak?koliv k n?mu p?ipojen? dokumenty jsou d?v?rn? a
podl?haj? tomuto pr?vn? z?vazn?mu prohl??en? o vylou?en? odpov?dnosti:
https://www.precheza.cz/01-dovetek/ | This email and any documents attached to
it may be confidential and are subject to the legally binding disclaimer:
https://www.precheza.cz/en/01-disclaimer/

R help - Oct 2018 - regexp mystery

[R] regexp mystery

[R] regexp mystery

[R] regexp mystery