thr3ads.net - R help - [R] Regular expression help [Oct 2017]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2017-Oct-09 15:02 UTC

[R] Regular expression help

I have a file containing "words" like


a

a/b

a/b/c

where there may be multiple words on a line (separated by spaces).? The 
a, b, and c strings can contain non-space, non-slash characters. I'd 
like to use gsub() to extract the c strings (which should be empty if 
there are none).

A real example is

"f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"

which I'd like to transform to

" 587 587 587 587"

Another real example is

"f 1067 28680 24462"

which should transform to "?? ".

I've tried a few different regexprs, but am unable to find a way to say 
"transform words by deleting everything up to and including the 2nd 
slash" when there might be zero, one or two slashes.? Any suggestions?

Duncan Murdoch

Ulrik Stervbo

2017-Oct-09 15:23 UTC

head link

[R] Regular expression help

Hi Duncan,

why not split on / and take the correct elements? It is not as elegant as
regex but could do the trick.

Best,
Ulrik

On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).  The
> a, b, and c strings can contain non-space, non-slash characters. I'd
> like to use gsub() to extract the c strings (which should be empty if
> there are none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to say
> "transform words by deleting everything up to and including the 2nd
> slash" when there might be zero, one or two slashes.  Any suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]

Eric Berger

2017-Oct-09 15:34 UTC

head link

[R] Regular expression help

Hi Duncan,
You can try this:

library(readr)
f <- function(s) {
  t <- unlist(readr::tokenize(paste0(gsub("
",",",s),"\n",collapse="")))
  i <- grep("[a-zA-Z0-9]*/[a-zA-Z0-9]*/",t)
  u <- sub("[a-zA-Z0-9]*/[a-zA-Z0-9]*/","",t[i])
  paste0(u,collapse=" ")
}

f("f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587")
# "587 587 587 587"

f("f 1067 28680 24462")
# ""

HTH,
Eric


On Mon, Oct 9, 2017 at 6:23 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
wrote:
> Hi Duncan,
>
> why not split on / and take the correct elements? It is not as elegant as
> regex but could do the trick.
>
> Best,
> Ulrik
>
> On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
> > I have a file containing "words" like
> >
> >
> > a
> >
> > a/b
> >
> > a/b/c
> >
> > where there may be multiple words on a line (separated by spaces). 
The
> > a, b, and c strings can contain non-space, non-slash characters.
I'd
> > like to use gsub() to extract the c strings (which should be empty if
> > there are none).
> >
> > A real example is
> >
> > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> >
> > which I'd like to transform to
> >
> > " 587 587 587 587"
> >
> > Another real example is
> >
> > "f 1067 28680 24462"
> >
> > which should transform to "   ".
> >
> > I've tried a few different regexprs, but am unable to find a way
to say
> > "transform words by deleting everything up to and including the
2nd
> > slash" when there might be zero, one or two slashes.  Any
suggestions?
> >
> > Duncan Murdoch
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

peter dalgaard

2017-Oct-09 15:45 UTC

head link

[R] Regular expression help

> On 9 Oct 2017, at 17:02 , Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:
> 
> I have a file containing "words" like
> 
> 
> a
> 
> a/b
> 
> a/b/c
> 
> where there may be multiple words on a line (separated by spaces).  The a,
b, and c strings can contain non-space, non-slash characters. I'd like to
use gsub() to extract the c strings (which should be empty if there are none).
> 
> A real example is
> 
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> 
> which I'd like to transform to
> 
> " 587 587 587 587"
> 
> Another real example is
> 
> "f 1067 28680 24462"
> 
> which should transform to "   ".
> 
> I've tried a few different regexprs, but am unable to find a way to say
"transform words by deleting everything up to and including the 2nd
slash" when there might be zero, one or two slashes.  Any suggestions?
> 
I think you might need something like this:

s <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
l <- strsplit(s, " ")[[1]]
pat <- "[[:alnum:]]*/[[:alnum:]]*/([[:alnum:]]*)"
paste(ifelse(grepl(pat,l),gsub(pat, "\\1", l), ""),
collapse=" ")

-pd
> Duncan Murdoch
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

William Dunlap

2017-Oct-09 15:50 UTC

head link

[R] Regular expression help

> x <- "f 147/1315/587 2820/1320/587 3624/1321/587
1852/1322/587"
> gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)
[1] " 587 587 587 587"> y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd
aa/bb/cc/dd/"
> gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)[1] "    cc cc/ cc/dd cc/dd/"


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).  The a,
> b, and c strings can contain non-space, non-slash characters. I'd like
to
> use gsub() to extract the c strings (which should be empty if there are
> none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to say
> "transform words by deleting everything up to and including the 2nd
slash"
> when there might be zero, one or two slashes.  Any suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]

William Dunlap

2017-Oct-09 16:06 UTC

head link

[R] Regular expression help

"(^| +)([^/ ]*/?){0,2}", with the first "*" replaced by
"+" would be a bit
better.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Oct 9, 2017 at 8:50 AM, William Dunlap <wdunlap at tibco.com>
wrote:
> > x <- "f 147/1315/587 2820/1320/587 3624/1321/587
1852/1322/587"
> > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)
> [1] " 587 587 587 587"
> > y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd
aa/bb/cc/dd/"
> > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)
> [1] "    cc cc/ cc/dd cc/dd/"
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
>> I have a file containing "words" like
>>
>>
>> a
>>
>> a/b
>>
>> a/b/c
>>
>> where there may be multiple words on a line (separated by spaces).  The
>> a, b, and c strings can contain non-space, non-slash characters.
I'd like
>> to use gsub() to extract the c strings (which should be empty if there
are
>> none).
>>
>> A real example is
>>
>> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>>
>> which I'd like to transform to
>>
>> " 587 587 587 587"
>>
>> Another real example is
>>
>> "f 1067 28680 24462"
>>
>> which should transform to "   ".
>>
>> I've tried a few different regexprs, but am unable to find a way to
say
>> "transform words by deleting everything up to and including the
2nd slash"
>> when there might be zero, one or two slashes.  Any suggestions?
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
	[[alternative HTML version deleted]]

Duncan Murdoch

2017-Oct-09 16:15 UTC

head link

[R] Regular expression help

On 09/10/2017 11:23 AM, Ulrik Stervbo wrote:> Hi Duncan,
> 
> why not split on / and take the correct elements? It is not as elegant 
> as regex but could do the trick.
Thanks for the suggestion.  There are likely many thousands of lines of 
data like the two real examples (which had about 5000 and 60000 lines 
respectively), so I was thinking that would be too slow, as it would 
involve nested strsplit() calls.  But in fact, it's not so bad, so I 
might go with it.  Here's a stab at it:

lines <- <the lines to be split, e.g. the lines starting with
"f" in
http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726>

l2 <- strsplit(lines, " ")
l3 <- lapply(l2, function(x) {
         y <- strsplit(x, "/")
         sapply(y, function(z) if (length(z) == 3) z[3] else "")
       })

Duncan
> 
> Best,
> Ulrik
> 
> On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at gmail.com 
> <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     I have a file containing "words" like
> 
> 
>     a
> 
>     a/b
> 
>     a/b/c
> 
>     where there may be multiple words on a line (separated by spaces).? The
>     a, b, and c strings can contain non-space, non-slash characters.
I'd
>     like to use gsub() to extract the c strings (which should be empty if
>     there are none).
> 
>     A real example is
> 
>     "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> 
>     which I'd like to transform to
> 
>     " 587 587 587 587"
> 
>     Another real example is
> 
>     "f 1067 28680 24462"
> 
>     which should transform to "?? ".
> 
>     I've tried a few different regexprs, but am unable to find a way to
say
>     "transform words by deleting everything up to and including the
2nd
>     slash" when there might be zero, one or two slashes.? Any
suggestions?
> 
>     Duncan Murdoch
> 
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

Georges Monette

2017-Oct-10 01:08 UTC

head link

[R] Regular expression help

How about this (I'm showing it as a pipe because it's easier to read 
that way):

library(magrittr)
"f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>%
 ? strsplit(' ') %>%
 ? unlist %>%
 ? sub('^[^/]*/*','',.) %>%
 ? sub('^[^/]*/*','',.) %>%
 ? paste(collapse = ' ')

Georges Monette

-- 
Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of Science,
Department of Mathematics & Statistics | North 626 Ross Building | York
University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone: 416-736-5250 |
Fax: 416-736-5757 | E-Mail: georges at yorku.ca


On 2017-10-09 11:02 AM, Duncan Murdoch wrote:> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).? 
> The a, b, and c strings can contain non-space, non-slash characters. 
> I'd like to use gsub() to extract the c strings (which should be empty 
> if there are none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "?? ".
>
> I've tried a few different regexprs, but am unable to find a way to 
> say "transform words by deleting everything up to and including the 
> 2nd slash" when there might be zero, one or two slashes.? Any 
> suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

David Winsemius

2017-Oct-10 16:09 UTC

head link

[R] Regular expression help

> On Oct 9, 2017, at 6:08 PM, Georges Monette <georges at yorku.ca>
wrote:
> 
> How about this (I'm showing it as a pipe because it's easier to
read that way):
> 
> library(magrittr)
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>%
>   strsplit(' ') %>%
>   unlist %>%
>   sub('^[^/]*/*','',.) %>%
>   sub('^[^/]*/*','',.) %>%
>   paste(collapse = ' ')
I'm old school R, so I don't find that particularly readable. I read the
later specification as saying each line began with an f, so the fourth item
after an strsplit becomes the target.

This seemed more readable to me:

Lines <-
readLines(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))
lines <- Lines[ grepl("^f", Lines) ]

str(lines)
# chr [1:62908] "f 14327 6959 18747" "f 8258 15598 18980"
"f 27662 21871 21939" ...

l2 <- strsplit(lines, " ")  # in that file the separators were
spaces
l3 <- sapply(l2[1:3], function(x) { if (length(x) == 4) x[4] else
""
      })
l3
#[1] "18747" "18980" "21939"

# Remove the `[1:3]` to get the entire result.


Best;
David.
> 
> Georges Monette
> 
> -- 
> Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of
Science, Department of Mathematics & Statistics | North 626 Ross Building |
York University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone:
416-736-5250 | Fax: 416-736-5757 | E-Mail: georges at yorku.ca
> 
> 
> On 2017-10-09 11:02 AM, Duncan Murdoch wrote:
>> I have a file containing "words" like
>> 
>> 
>> a
>> 
>> a/b
>> 
>> a/b/c
>> 
>> where there may be multiple words on a line (separated by spaces).  The
a, b, and c strings can contain non-space, non-slash characters. I'd like to
use gsub() to extract the c strings (which should be empty if there are none).
>> 
>> A real example is
>> 
>> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>> 
>> which I'd like to transform to
>> 
>> " 587 587 587 587"
>> 
>> Another real example is
>> 
>> "f 1067 28680 24462"
>> 
>> which should transform to "   ".
>> 
>> I've tried a few different regexprs, but am unable to find a way to
say "transform words by deleting everything up to and including the 2nd
slash" when there might be zero, one or two slashes.  Any suggestions?
>> 
>> Duncan Murdoch
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' 
-Gehm's Corollary to Clarke's Third Law

Maybe Matching Threads

Search for more maybe matching threads

R help - Oct 2017 - Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

[R] Regular expression help

Maybe Matching Threads