thr3ads.net - R help - [R] regex [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Ivan Calandra

2019-Sep-17 06:48 UTC

[R] regex

Dear useRs,

I still have problems using regular expressions. I have two problems for 
which I have found workarounds, but I'm sure there are better ways of 
doing it.

1) list CSV files with "_w_" in the name

Here is a sample of the files in the folder:
myfiles <- c("BU-072_1_E1_RE_SEC-01_local_a_0.2_0.2.csv", 
"BU-072_1_E1_RE_SEC-01_local_a_0.2_0.6.csv","BU-072_1_E1_RE_SEC-01_local_a_0.4_1.0.csv",
"BU-072_1_E1_RE_SEC-01_local_a_1.0_0.2.csv","BU-072_1_E1_RE_SEC-01_local_a_1.0_0.6.csv",
"BU-072_1_E1_RE_SEC-01_local_w_0.2_0.2.csv","BU-072_1_E1_RE_SEC-01_local_w_0.2_0.6.csv",
"BU-072_1_E1_RE_SEC-01_local_w_0.4_1.0.csv","BU-072_1_E1_RE_SEC-01_local_w_1.0_0.2.csv",
"BU-072_1_E1_RE_SEC-01_local_w_1.0_0.6.csv","BU-072_1_E1_RE_SEC-01_local_w_1.0_1.0.csv",
"BU-072_1_E1_RE_SEC-01_local_a_0.2_0.2.xls","BU-072_1_E1_RE_SEC-01_local_a_0.2_0.6.xls",
"BU-072_1_E1_RE_SEC-01_local_a_0.4_1.0.xls","BU-072_1_E1_RE_SEC-01_local_a_1.0_0.2.xls",
"BU-072_1_E1_RE_SEC-01_local_a_1.0_0.6.xls","BU-072_1_E1_RE_SEC-01_local_w_0.2_0.2.xls",
"BU-072_1_E1_RE_SEC-01_local_w_0.2_0.6.xls","BU-072_1_E1_RE_SEC-01_local_w_0.4_1.0.xls",
"BU-072_1_E1_RE_SEC-01_local_w_1.0_0.2.xls","BU-072_1_E1_RE_SEC-01_local_w_1.0_0.6.xls",
"BU-072_1_E1_RE_SEC-01_local_w_1.0_1.0.xls")

Here is what I did: CSVs <- list.files(path=..., pattern="\\.csv$")
w.files <- CSVs[grep(pattern="_w_", CSVs)]

Of course, what I would like to do is list only the interesting files 
from the beginning, rather than subsetting the whole list of files. In 
other words, having a pattern that includes both "\\.csv$" and
"_w_" in
the list.files() call. I tried "_w_&\\.csv$" but it returns an
empty vector.

2) The units of the variables are given in the original headers. I would 
like to extract the units. This is what I did: headers <- c("dist to 
origin on curve [mm]","segment on section [mm]", "angle 1
[degree]",
"angle 2 [degree]","angle 3 [degree]") units.var <- 
gsub(pattern="^.*\\[|\\]$", "", headers)

It seems to be to overly complicated using gsub(). Isn't there a way to 
extract what is interesting rather than deleting what is not?

Thank you for your help! Best, Ivan

-- 
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

Ivan Krylov

2019-Sep-17 07:14 UTC

head link

[R] regex

On Tue, 17 Sep 2019 08:48:43 +0200
Ivan Calandra <calandra at rgzm.de> wrote:
> CSVs <- list.files(path=..., pattern="\\.csv$") 
> w.files <- CSVs[grep(pattern="_w_", CSVs)]
> 
> Of course, what I would like to do is list only the interesting files 
> from the beginning, rather than subsetting the whole list of files.
One way to express that would be "_w_.*\\.csv$", meaning that the
filename has to have "_w_" in it, followed by anything (any character
repeated any number of times, including 0), followed by ".csv" at the
end of the line.
> 2) The units of the variables are given in the original headers. I
> would like to extract the units. This is what I did: headers <-
> c("dist to origin on curve [mm]","segment on section
[mm]", "angle 1
> [degree]", "angle 2 [degree]","angle 3 [degree]")
units.var <-
> gsub(pattern="^.*\\[|\\]$", "", headers)
> 
> It seems to be to overly complicated using gsub(). Isn't there a way
> to extract what is interesting rather than deleting what is not?
Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
take the character vector as an argument, so duplication is hard to
avoid:

units <- regmatches(headers, regexpr('\\[.*\\]', headers))

The stringr package has an str_match() function with a nicer interface:
str_match(headers, '\\[.*\\]') -> units.

Such "greedy" patterns containing ".*" present a few
pitfalls, e.g.
looking for text in parentheses using the pattern "\\(.*\\)" in
"...(abc)...(def)..." will match the whole "(abc)...(def)"
instead of
single groups "(abc)" and "(def)", but with your examples
the pattern
should work as presented. One other option would be to ask for "[",
followed by zero or more characters that are not "]", followed by
"]":
'\\[[^]]*\\]'.

-- 
Best regards,
Ivan

Ivan Krylov

2019-Sep-17 07:25 UTC

head link

[R] regex

On Tue, 17 Sep 2019 10:14:24 +0300
Ivan Krylov <krylov.r00t at gmail.com> wrote:
> '\\[.*\\]'
Sorry, I forgot to take it into account that you don't want the [] in
your units, either. That's still doable, but requires so-called
look-around assertions in the regular expression:

'(?<=\\[).*(?=\\])'

This should match any characters that are preceded by "[" and followed
by "]", but without including the brackets in the match. This requires
passing perl = TRUE to regexpr(). stringr::str_match() understands this
pattern without any additional flags.

-- 
Best regards,
Ivan

Ivan Calandra

2019-Sep-17 13:39 UTC

head link

[R] regex

Thank you Ivan for your help!

Your solution for the first problem is so simple I didn't even think 
about it!
What I find weird is that "_w_|\\.csv$" works as expected
("OR"), but is
there no way to combine two patterns with an "AND"?

Your solution to the second problem is actually unfortunately even more 
complicated to me than the gsub() solution. But I'm glad I can learn 
about regmatches() and regexpr()!

Best,
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 09:14, Ivan Krylov wrote:> On Tue, 17 Sep 2019 08:48:43 +0200
> Ivan Calandra <calandra at rgzm.de> wrote:
>
>> CSVs <- list.files(path=..., pattern="\\.csv$")
>> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>>
>> Of course, what I would like to do is list only the interesting files
>> from the beginning, rather than subsetting the whole list of files.
> One way to express that would be "_w_.*\\.csv$", meaning that the
> filename has to have "_w_" in it, followed by anything (any
character
> repeated any number of times, including 0), followed by ".csv" at
the
> end of the line.
>
>> 2) The units of the variables are given in the original headers. I
>> would like to extract the units. This is what I did: headers <-
>> c("dist to origin on curve [mm]","segment on section
[mm]", "angle 1
>> [degree]", "angle 2 [degree]","angle 3
[degree]") units.var <-
>> gsub(pattern="^.*\\[|\\]$", "", headers)
>>
>> It seems to be to overly complicated using gsub(). Isn't there a
way
>> to extract what is interesting rather than deleting what is not?
> Pure-R way: use regmatches() + regexpr(). Both regmatches and regexpr
> take the character vector as an argument, so duplication is hard to
> avoid:
>
> units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>
> The stringr package has an str_match() function with a nicer interface:
> str_match(headers, '\\[.*\\]') -> units.
>
> Such "greedy" patterns containing ".*" present a few
pitfalls, e.g.
> looking for text in parentheses using the pattern "\\(.*\\)" in
> "...(abc)...(def)..." will match the whole
"(abc)...(def)" instead of
> single groups "(abc)" and "(def)", but with your
examples the pattern
> should work as presented. One other option would be to ask for
"[",
> followed by zero or more characters that are not "]", followed by
"]":
> '\\[[^]]*\\]'.
>

Richard O'Keefe

2019-Sep-18 11:13 UTC

head link

[R] regex

A little note on quoting in regular expressions.
I find writing \\. when I want a quoted . somewhat confusing,
so I would use the pattern "_w_.*[.]csv$".

Better still, if you want to match file names,
there is a function glob2rx that converts shell ("glob")
patterns into regular expression patterns.  Thus> grep(glob2rx("*_w_*.csv"), myfiles, value=TRUE)[1] "BU-072_1_E1_RE_SEC-01_local_w_0.2_0.2.csv"
[2] "BU-072_1_E1_RE_SEC-01_local_w_0.2_0.6.csv"
[3] "BU-072_1_E1_RE_SEC-01_local_w_0.4_1.0.csv"
[4] "BU-072_1_E1_RE_SEC-01_local_w_1.0_0.2.csv"
[5] "BU-072_1_E1_RE_SEC-01_local_w_1.0_0.6.csv"
[6] "BU-072_1_E1_RE_SEC-01_local_w_1.0_1.0.csv"

So the simplest way to get what you want is
CSVs <- list.files(path=..., pattern=glob2rx("*_w_*.csv"))

In fact ?list.files mentions glob2rx.


On Tue, 17 Sep 2019 at 18:49, Ivan Calandra <calandra at rgzm.de> wrote:
> Dear useRs,
>
> I still have problems using regular expressions. I have two problems for
> which I have found workarounds, but I'm sure there are better ways of
> doing it.
>
> 1) list CSV files with "_w_" in the name
>
> Here is a sample of the files in the folder:
> myfiles <- c("BU-072_1_E1_RE_SEC-01_local_a_0.2_0.2.csv",
>
"BU-072_1_E1_RE_SEC-01_local_a_0.2_0.6.csv","BU-072_1_E1_RE_SEC-01_local_a_0.4_1.0.csv",
>
>
"BU-072_1_E1_RE_SEC-01_local_a_1.0_0.2.csv","BU-072_1_E1_RE_SEC-01_local_a_1.0_0.6.csv",
>
>
"BU-072_1_E1_RE_SEC-01_local_w_0.2_0.2.csv","BU-072_1_E1_RE_SEC-01_local_w_0.2_0.6.csv",
>
>
"BU-072_1_E1_RE_SEC-01_local_w_0.4_1.0.csv","BU-072_1_E1_RE_SEC-01_local_w_1.0_0.2.csv",
>
>
"BU-072_1_E1_RE_SEC-01_local_w_1.0_0.6.csv","BU-072_1_E1_RE_SEC-01_local_w_1.0_1.0.csv",
>
>
"BU-072_1_E1_RE_SEC-01_local_a_0.2_0.2.xls","BU-072_1_E1_RE_SEC-01_local_a_0.2_0.6.xls",
>
>
"BU-072_1_E1_RE_SEC-01_local_a_0.4_1.0.xls","BU-072_1_E1_RE_SEC-01_local_a_1.0_0.2.xls",
>
>
"BU-072_1_E1_RE_SEC-01_local_a_1.0_0.6.xls","BU-072_1_E1_RE_SEC-01_local_w_0.2_0.2.xls",
>
>
"BU-072_1_E1_RE_SEC-01_local_w_0.2_0.6.xls","BU-072_1_E1_RE_SEC-01_local_w_0.4_1.0.xls",
>
>
"BU-072_1_E1_RE_SEC-01_local_w_1.0_0.2.xls","BU-072_1_E1_RE_SEC-01_local_w_1.0_0.6.xls",
>
> "BU-072_1_E1_RE_SEC-01_local_w_1.0_1.0.xls")
>
> Here is what I did: CSVs <- list.files(path=...,
pattern="\\.csv$")
> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>
> Of course, what I would like to do is list only the interesting files
> from the beginning, rather than subsetting the whole list of files. In
> other words, having a pattern that includes both "\\.csv$" and
"_w_" in
> the list.files() call. I tried "_w_&\\.csv$" but it returns
an empty
> vector.
>
> 2) The units of the variables are given in the original headers. I would
> like to extract the units. This is what I did: headers <- c("dist
to
> origin on curve [mm]","segment on section [mm]", "angle
1 [degree]",
> "angle 2 [degree]","angle 3 [degree]") units.var <-
> gsub(pattern="^.*\\[|\\]$", "", headers)
>
> It seems to be to overly complicated using gsub(). Isn't there a way to
> extract what is interesting rather than deleting what is not?
>
> Thank you for your help! Best, Ivan
>
> --
> Dr. Ivan Calandra
> TraCEr, laboratory for Traceology and Controlled Experiments
> MONREPOS Archaeological Research Centre and
> Museum for Human Behavioural Evolution
> Schloss Monrepos
> 56567 Neuwied, Germany
> +49 (0) 2631 9772-243
> https://www.researchgate.net/profile/Ivan_Calandra
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Sep 2019 - regex

[R] regex

[R] regex

[R] regex

[R] regex

[R] regex