Dear all I have text file with lines like this.> dput(x[9])"PYedehYev: 300 s Z?va~?: 2.160 kg"> dput(x[11])"et odYezko: 3 \fas odYezku: 15 s" I am able to extract some numbers but others give me headache. gsub("^.*[^:] (\\d+.\\d+).*$", "\\1", x[9]) works for 300 gsub("^.*[:] (\\d+.\\d+).*$", "\\1", x[9]) works for 2.160 gsub("^.*: (\\d+).*$", "\\1", x[11]) works for 3 but only after many attempts I found that gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) works for 15 Can somebody explain my why for line 11 second item requires almost equvalent regular expression as first item in line 9 and why just gsub("^.*[:] (\\d+).*$", "\\1", x[11]) does not produce 15 but 3??? Cheers Petr Osobn? ?daje: Informace o zpracov?n? a ochran? osobn?ch ?daj? obchodn?ch partner? PRECHEZA a.s. jsou zve?ejn?ny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner?s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/ D?v?rnost: Tento e-mail a jak?koliv k n?mu p?ipojen? dokumenty jsou d?v?rn? a podl?haj? tomuto pr?vn? z?vazn?mu prohl??en? o vylou?en? odpov?dnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/
On Tue, 16 Oct 2018 08:36:27 +0000 PIKAL Petr <petr.pikal at precheza.cz> wrote:> > dput(x[11]) > "et odYezko: 3 \fas odYezku: 15 s"> gsub("^.*: (\\d+).*$", "\\1", x[11]) > works for 3This regular expression only matches one space between the colon and the number, but you have more than one of them before "15".> gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) > works for 15Match succeeds because a space is not a colon: ^.* matches "et odYezko: 3 \fas odYezku: " [^:] matches space " " space " " matches another space " " finally, (\\d+) matches "15" and .*$ matches " s" If you need just the numbers, you might have more success by extracting matches directly with gregexpr and regmatches: ( function(s) regmatches( s, gregexpr("\\d+(\\.\\d+)?", s) ) )("et odYezko: 3 \fas odYezku: 15 s") [[1]] [1] "3" "15" (I'm creating an anonymous function and evaluating it immediately because I need to pass the same string to both gregexpr and regmatches.) If you need to capture numbers appearing in a specific context, a better regular expression suiting your needs might be ":\\s*(\\d+(?:\\.\\d+)?)" (A colon, followed by optional whitespace, followed by a number to capture, consisting of decimals followed by optional, non-captured dot followed by decimals) but I couldn't find a way to extract captures from repeated match by using vanilla R pattern matching (it's either regexec which returns captures for the first match or gregexpr which returns all matches but without the captures). If you can load the stringr package, it's very easy, though: str_match_all( c( "PYedehYev: 300 s Z?va~?: 2.160 kg", "et odYezko: 3 \fas odYezku: 15 s" ), ":\\s*(\\d+(?:\\.\\d+)?)" ) [[1]] [,1] [,2] [1,] ": 300" "300" [2,] ": 2.160" "2.160" [[2]] [,1] [,2] [1,] ": 3" "3" [2,] ": 15" "15" Column 2 of each list item contains the requested captures. -- Best regards, Ivan
Hi Thanks a lot for your insightful answer. I will need to study it in detail, gregexpr and regexpr seems to be quite handy for what I need. Cheers Petr> -----Original Message----- > From: Ivan Krylov <krylov.r00t at gmail.com> > Sent: Tuesday, October 16, 2018 11:08 AM > To: PIKAL Petr <petr.pikal at precheza.cz> > Cc: r-help at r-project.org > Subject: Re: [R] regexp mystery > > On Tue, 16 Oct 2018 08:36:27 +0000 > PIKAL Petr <petr.pikal at precheza.cz> wrote: > > > > dput(x[11]) > > "et odYezko: 3 \fas odYezku: 15 s" > > > gsub("^.*: (\\d+).*$", "\\1", x[11]) > > works for 3 > > This regular expression only matches one space between the colon and the > number, but you have more than one of them before "15". > > > gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) works for 15 > > Match succeeds because a space is not a colon: > > ^.* matches "et odYezko: 3 \fas odYezku: " > [^:] matches space " " > space " " matches another space " " > finally, (\\d+) matches "15" > and .*$ matches " s" > > If you need just the numbers, you might have more success by extracting > matches directly with gregexpr and regmatches: > > ( > function(s) regmatches( > s, > gregexpr("\\d+(\\.\\d+)?", s) > ) > )("et odYezko: 3 \fas odYezku: 15 s") > > [[1]] > [1] "3" "15" > > (I'm creating an anonymous function and evaluating it immediately because I > need to pass the same string to both gregexpr and regmatches.) > > If you need to capture numbers appearing in a specific context, a better regular > expression suiting your needs might be > > ":\\s*(\\d+(?:\\.\\d+)?)" > > (A colon, followed by optional whitespace, followed by a number to capture, > consisting of decimals followed by optional, non-captured dot followed by > decimals) > > but I couldn't find a way to extract captures from repeated match by using > vanilla R pattern matching (it's either regexec which returns captures for the > first match or gregexpr which returns all matches but without the captures). If > you can load the stringr package, it's very easy, though: > > str_match_all( > c( > "PYedehYev: 300 s Z?va~?: 2.160 kg", > "et odYezko: 3 \fas odYezku: 15 s" > ), > ":\\s*(\\d+(?:\\.\\d+)?)" > ) > [[1]] > [,1] [,2] > [1,] ": 300" "300" > [2,] ": 2.160" "2.160" > > [[2]] > [,1] [,2] > [1,] ": 3" "3" > [2,] ": 15" "15" > > Column 2 of each list item contains the requested captures. > > -- > Best regards, > IvanOsobn? ?daje: Informace o zpracov?n? a ochran? osobn?ch ?daj? obchodn?ch partner? PRECHEZA a.s. jsou zve?ejn?ny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner?s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/ D?v?rnost: Tento e-mail a jak?koliv k n?mu p?ipojen? dokumenty jsou d?v?rn? a podl?haj? tomuto pr?vn? z?vazn?mu prohl??en? o vylou?en? odpov?dnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/