thr3ads.net - R help - [R] regex [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Ivan Calandra

2019-Sep-17 13:39 UTC

[R] regex

Thank you Ivan for your help!

Your solution for the first problem is so simple I didn't even think 
about it!
What I find weird is that "_w_|\\.csv$" works as expected
("OR"), but is
there no way to combine two patterns with an "AND"?

Your solution to the second problem is actually unfortunately even more 
complicated to me than the gsub() solution. But I'm glad I can learn 
about regmatches() and regexpr()!

Best,
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 09:14, Ivan Krylov wrote:> On Tue, 17 Sep 2019 08:48:43 +0200
> Ivan Calandra <calandra at rgzm.de> wrote:
>
>> CSVs <- list.files(path=..., pattern="\\.csv$")
>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>
>> Of course, what I would like to do is list only the interesting files
>> from the beginning, rather than subsetting the whole list of files.
> One way to express that would be "_w_.*\\.csv$", meaning that the
> filename has to have "_w_" in it, followed by anything (any
character
> repeated any number of times, including 0), followed by ".csv" at
the
> end of the line.
>
>> 2) The units of the variables are given in the original headers. I
>> would like to extract the units. This is what I did: headers <-
>> c("dist to origin on curve [mm]","segment on section
[mm]", "angle 1
>> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>
>> It seems to be to overly complicated using gsub(). Isn't there a
way
>> to extract what is interesting rather than deleting what is not?
> Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
> take the character vector as an argument, so duplication is hard to
> avoid:
>
> units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>
> The stringr package has an str_match() function with a nicer interface:
> str_match(headers, '\\[.*\\]') -> units.
>
> Such "greedy" patterns containing ".*" present a few
pitfalls, e.g.
> looking for text in parentheses using the pattern "\\(.*\\)" in
> "...(abc)...(def)..." will match the whole
"(abc)...(def)" instead of
> single groups "(abc)" and "(def)", but with your
examples the pattern
> should work as presented. One other option would be to ask for
"[",
> followed by zero or more characters that are not "]", followed by
"]":
> '\\[[^]]*\\]'.
>

Jeff Newmiller

2019-Sep-17 14:38 UTC

head link

[R] regex

https://stackoverflow.com/questions/3041320/regex-and-operator/37692545

On September 17, 2019 6:39:13 AM PDT, Ivan Calandra <calandra at rgzm.de>
wrote:>Thank you Ivan for your help!
>
>Your solution for the first problem is so simple I didn't even think 
>about it!
>What I find weird is that "_w_|\\.csv$" works as expected
("OR"), but
>is 
>there no way to combine two patterns with an "AND"?
>
>Your solution to the second problem is actually unfortunately even more
>
>complicated to me than the gsub() solution. But I'm glad I can learn 
>about regmatches() and regexpr()!
>
>Best,
>Ivan
>
>--
>Dr. Ivan Calandra
>TraCEr, laboratory for Traceology and Controlled Experiments
>MONREPOS Archaeological Research Centre and
>Museum for Human Behavioural Evolution
>Schloss Monrepos
>56567 Neuwied, Germany
>+49 (0) 2631 9772-243
>https://www.researchgate.net/profile/Ivan_Calandra
>
>On 17/09/2019 09:14, Ivan Krylov wrote:
>> On Tue, 17 Sep 2019 08:48:43 +0200
>> Ivan Calandra <calandra at rgzm.de> wrote:
>>
>>> CSVs <- list.files(path=..., pattern="\\.csv$")
>>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>>
>>> Of course, what I would like to do is list only the interesting
>files
>>> from the beginning, rather than subsetting the whole list of files.
>> One way to express that would be "_w_.*\\.csv$", meaning that
the
>> filename has to have "_w_" in it, followed by anything (any
character
>> repeated any number of times, including 0), followed by
".csv" at the
>> end of the line.
>>
>>> 2) The units of the variables are given in the original headers. I
>>> would like to extract the units. This is what I did: headers <-
>>> c("dist to origin on curve [mm]","segment on section
[mm]", "angle 1
>>> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
>>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>>
>>> It seems to be to overly complicated using gsub(). Isn't there
a way
>>> to extract what is interesting rather than deleting what is not?
>> Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
>> take the character vector as an argument, so duplication is hard to
>> avoid:
>>
>> units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>>
>> The stringr package has an str_match() function with a nicer
>interface:
>> str_match(headers, '\\[.*\\]') -> units.
>>
>> Such "greedy" patterns containing ".*" present a
few pitfalls, e.g.
>> looking for text in parentheses using the pattern "\\(.*\\)"
in
>> "...(abc)...(def)..." will match the whole
"(abc)...(def)" instead of
>> single groups "(abc)" and "(def)", but with your
examples the pattern
>> should work as presented. One other option would be to ask for
"[",
>> followed by zero or more characters that are not "]",
followed by
>"]":
>> '\\[[^]]*\\]'.
>>
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

Bert Gunter

2019-Sep-17 14:42 UTC

head link

[R] regex

(For the units)

Why not simply:

sub(".*\\[(.+)\\]","\\1", headers)

Cheers,
Bert


On Tue, Sep 17, 2019 at 6:40 AM Ivan Calandra <calandra at rgzm.de> wrote:
> Thank you Ivan for your help!
>
> Your solution for the first problem is so simple I didn't even think
> about it!
> What I find weird is that "_w_|\\.csv$" works as expected
("OR"), but is
> there no way to combine two patterns with an "AND"?
>
> Your solution to the second problem is actually unfortunately even more
> complicated to me than the gsub() solution. But I'm glad I can learn
> about regmatches() and regexpr()!
>
> Best,
> Ivan
>
> --
> Dr. Ivan Calandra
> TraCEr, laboratory for Traceology and Controlled Experiments
> MONREPOS Archaeological Research Centre and
> Museum for Human Behavioural Evolution
> Schloss Monrepos
> 56567 Neuwied, Germany
> +49 (0) 2631 9772-243
> https://www.researchgate.net/profile/Ivan_Calandra
>
> On 17/09/2019 09:14, Ivan Krylov wrote:
> > On Tue, 17 Sep 2019 08:48:43 +0200
> > Ivan Calandra <calandra at rgzm.de> wrote:
> >
> >> CSVs <- list.files(path=..., pattern="\\.csv$")
> >> w.files <- CSVs[grep(pattern="_w_", CSVs)]
> >>
> >> Of course, what I would like to do is list only the interesting
files
> >> from the beginning, rather than subsetting the whole list of
files.
> > One way to express that would be "_w_.*\\.csv$", meaning
that the
> > filename has to have "_w_" in it, followed by anything (any
character
> > repeated any number of times, including 0), followed by
".csv" at the
> > end of the line.
> >
> >> 2) The units of the variables are given in the original headers. I
> >> would like to extract the units. This is what I did: headers <-
> >> c("dist to origin on curve [mm]","segment on
section [mm]", "angle 1
> >> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
> >> gsub(pattern="^.*\\[|\\]$", "", headers)
> >>
> >> It seems to be to overly complicated using gsub(). Isn't there
a way
> >> to extract what is interesting rather than deleting what is not?
> > Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
> > take the character vector as an argument, so duplication is hard to
> > avoid:
> >
> > units <- regmatches(headers, regexpr('\\[.*\\]', headers))
> >
> > The stringr package has an str_match() function with a nicer
interface:
> > str_match(headers, '\\[.*\\]') -> units.
> >
> > Such "greedy" patterns containing ".*" present a
few pitfalls, e.g.
> > looking for text in parentheses using the pattern "\\(.*\\)"
in
> > "...(abc)...(def)..." will match the whole
"(abc)...(def)" instead of
> > single groups "(abc)" and "(def)", but with your
examples the pattern
> > should work as presented. One other option would be to ask for
"[",
> > followed by zero or more characters that are not "]",
followed by "]":
> > '\\[[^]]*\\]'.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ivan Calandra

2019-Sep-17 14:46 UTC

head link

[R] regex

Thanks Jeff!
It does indeed make sense that there is no "AND" corresponding to the
"|".

Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 16:38, Jeff Newmiller wrote:> https://stackoverflow.com/questions/3041320/regex-and-operator/37692545
>
> On September 17, 2019 6:39:13 AM PDT, Ivan Calandra <calandra at
rgzm.de> wrote:
>> Thank you Ivan for your help!
>>
>> Your solution for the first problem is so simple I didn't even
think
>> about it!
>> What I find weird is that "_w_|\\.csv$" works as expected
("OR"), but
>> is
>> there no way to combine two patterns with an "AND"?
>>
>> Your solution to the second problem is actually unfortunately even more
>>
>> complicated to me than the gsub() solution. But I'm glad I can
learn
>> about regmatches() and regexpr()!
>>
>> Best,
>> Ivan
>>
>> --
>> Dr. Ivan Calandra
>> TraCEr, laboratory for Traceology and Controlled Experiments
>> MONREPOS Archaeological Research Centre and
>> Museum for Human Behavioural Evolution
>> Schloss Monrepos
>> 56567 Neuwied, Germany
>> +49 (0) 2631 9772-243
>> https://www.researchgate.net/profile/Ivan_Calandra
>>
>> On 17/09/2019 09:14, Ivan Krylov wrote:
>>> On Tue, 17 Sep 2019 08:48:43 +0200
>>> Ivan Calandra <calandra at rgzm.de> wrote:
>>>
>>>> CSVs <- list.files(path=..., pattern="\\.csv$")
>>>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>>>
>>>> Of course, what I would like to do is list only the interesting
>> files
>>>> from the beginning, rather than subsetting the whole list of
files.
>>> One way to express that would be "_w_.*\\.csv$", meaning
that the
>>> filename has to have "_w_" in it, followed by anything
(any character
>>> repeated any number of times, including 0), followed by
".csv" at the
>>> end of the line.
>>>
>>>> 2) The units of the variables are given in the original
headers. I
>>>> would like to extract the units. This is what I did: headers
<-
>>>> c("dist to origin on curve [mm]","segment on
section [mm]", "angle 1
>>>> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
>>>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>>>
>>>> It seems to be to overly complicated using gsub(). Isn't
there a way
>>>> to extract what is interesting rather than deleting what is
not?
>>> Pure-R way: use regmatches() + regexpr(). Both regmatches and
regexpr
>>> take the character vector as an argument, so duplication is hard to
>>> avoid:
>>>
>>> units <- regmatches(headers, regexpr('\\[.*\\]',
headers))
>>>
>>> The stringr package has an str_match() function with a nicer
>> interface:
>>> str_match(headers, '\\[.*\\]') -> units.
>>>
>>> Such "greedy" patterns containing ".*" present
a few pitfalls, e.g.
>>> looking for text in parentheses using the pattern
"\\(.*\\)" in
>>> "...(abc)...(def)..." will match the whole
"(abc)...(def)" instead of
>>> single groups "(abc)" and "(def)", but with
your examples the pattern
>>> should work as presented. One other option would be to ask for
"[",
>>> followed by zero or more characters that are not "]",
followed by
>> "]":
>>> '\\[[^]]*\\]'.
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

Ivan Calandra

2019-Sep-17 14:52 UTC

head link

[R] regex

Thank you Bert.
That's more like what I was looking for.

Could you please tell me where I can find information on the "\\1"?
This
is the part I still don't get.

Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 16:42, Bert Gunter wrote:> (For the units)
>
> Why not simply:
>
> sub(".*\\[(.+)\\]","\\1", headers)
>
> Cheers,
> Bert
>
>
> On Tue, Sep 17, 2019 at 6:40 AM Ivan Calandra <calandra at rgzm.de 
> <mailto:calandra at rgzm.de>> wrote:
>
>     Thank you Ivan for your help!
>
>     Your solution for the first problem is so simple I didn't even
think
>     about it!
>     What I find weird is that "_w_|\\.csv$" works as expected
("OR"),
>     but is
>     there no way to combine two patterns with an "AND"?
>
>     Your solution to the second problem is actually unfortunately even
>     more
>     complicated to me than the gsub() solution. But I'm glad I can
learn
>     about regmatches() and regexpr()!
>
>     Best,
>     Ivan
>
>     --
>     Dr. Ivan Calandra
>     TraCEr, laboratory for Traceology and Controlled Experiments
>     MONREPOS Archaeological Research Centre and
>     Museum for Human Behavioural Evolution
>     Schloss Monrepos
>     56567 Neuwied, Germany
>     +49 (0) 2631 9772-243
>     https://www.researchgate.net/profile/Ivan_Calandra
>
>     On 17/09/2019 09:14, Ivan Krylov wrote:
>     > On Tue, 17 Sep 2019 08:48:43 +0200
>     > Ivan Calandra <calandra at rgzm.de <mailto:calandra at
rgzm.de>> wrote:
>     >
>     >> CSVs <- list.files(path=..., pattern="\\.csv$")
>     >> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>     >>
>     >> Of course, what I would like to do is list only the
interesting
>     files
>     >> from the beginning, rather than subsetting the whole list of
files.
>     > One way to express that would be "_w_.*\\.csv$", meaning
that the
>     > filename has to have "_w_" in it, followed by anything
(any
>     character
>     > repeated any number of times, including 0), followed by
".csv"
>     at the
>     > end of the line.
>     >
>     >> 2) The units of the variables are given in the original
headers. I
>     >> would like to extract the units. This is what I did: headers
<-
>     >> c("dist to origin on curve [mm]","segment on
section [mm]",
>     "angle 1
>     >> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
>     >> gsub(pattern="^.*\\[|\\]$", "", headers)
>     >>
>     >> It seems to be to overly complicated using gsub(). Isn't
there
>     a way
>     >> to extract what is interesting rather than deleting what is
not?
>     > Pure-R way: use regmatches() + regexpr(). Both regmatches and
>     regexpr
>     > take the character vector as an argument, so duplication is hard
to
>     > avoid:
>     >
>     > units <- regmatches(headers, regexpr('\\[.*\\]',
headers))
>     >
>     > The stringr package has an str_match() function with a nicer
>     interface:
>     > str_match(headers, '\\[.*\\]') -> units.
>     >
>     > Such "greedy" patterns containing ".*" present
a few pitfalls, e.g.
>     > looking for text in parentheses using the pattern
"\\(.*\\)" in
>     > "...(abc)...(def)..." will match the whole
"(abc)...(def)"
>     instead of
>     > single groups "(abc)" and "(def)", but with
your examples the
>     pattern
>     > should work as presented. One other option would be to ask for
"[",
>     > followed by zero or more characters that are not "]",
followed
>     by "]":
>     > '\\[[^]]*\\]'.
>     >
>
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Sep 2019 - regex

[R] regex

[R] regex

[R] regex

[R] regex

[R] regex