I have a file containing "words" like a a/b a/b/c where there may be multiple words on a line (separated by spaces).? The a, b, and c strings can contain non-space, non-slash characters. I'd like to use gsub() to extract the c strings (which should be empty if there are none). A real example is "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" which I'd like to transform to " 587 587 587 587" Another real example is "f 1067 28680 24462" which should transform to "?? ". I've tried a few different regexprs, but am unable to find a way to say "transform words by deleting everything up to and including the 2nd slash" when there might be zero, one or two slashes.? Any suggestions? Duncan Murdoch
Hi Duncan, why not split on / and take the correct elements? It is not as elegant as regex but could do the trick. Best, Ulrik On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> I have a file containing "words" like > > > a > > a/b > > a/b/c > > where there may be multiple words on a line (separated by spaces). The > a, b, and c strings can contain non-space, non-slash characters. I'd > like to use gsub() to extract the c strings (which should be empty if > there are none). > > A real example is > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > which I'd like to transform to > > " 587 587 587 587" > > Another real example is > > "f 1067 28680 24462" > > which should transform to " ". > > I've tried a few different regexprs, but am unable to find a way to say > "transform words by deleting everything up to and including the 2nd > slash" when there might be zero, one or two slashes. Any suggestions? > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Hi Duncan, You can try this: library(readr) f <- function(s) { t <- unlist(readr::tokenize(paste0(gsub(" ",",",s),"\n",collapse=""))) i <- grep("[a-zA-Z0-9]*/[a-zA-Z0-9]*/",t) u <- sub("[a-zA-Z0-9]*/[a-zA-Z0-9]*/","",t[i]) paste0(u,collapse=" ") } f("f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587") # "587 587 587 587" f("f 1067 28680 24462") # "" HTH, Eric On Mon, Oct 9, 2017 at 6:23 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:> Hi Duncan, > > why not split on / and take the correct elements? It is not as elegant as > regex but could do the trick. > > Best, > Ulrik > > On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > > > I have a file containing "words" like > > > > > > a > > > > a/b > > > > a/b/c > > > > where there may be multiple words on a line (separated by spaces). The > > a, b, and c strings can contain non-space, non-slash characters. I'd > > like to use gsub() to extract the c strings (which should be empty if > > there are none). > > > > A real example is > > > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > > > which I'd like to transform to > > > > " 587 587 587 587" > > > > Another real example is > > > > "f 1067 28680 24462" > > > > which should transform to " ". > > > > I've tried a few different regexprs, but am unable to find a way to say > > "transform words by deleting everything up to and including the 2nd > > slash" when there might be zero, one or two slashes. Any suggestions? > > > > Duncan Murdoch > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
> On 9 Oct 2017, at 17:02 , Duncan Murdoch <murdoch.duncan at gmail.com> wrote: > > I have a file containing "words" like > > > a > > a/b > > a/b/c > > where there may be multiple words on a line (separated by spaces). The a, b, and c strings can contain non-space, non-slash characters. I'd like to use gsub() to extract the c strings (which should be empty if there are none). > > A real example is > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > which I'd like to transform to > > " 587 587 587 587" > > Another real example is > > "f 1067 28680 24462" > > which should transform to " ". > > I've tried a few different regexprs, but am unable to find a way to say "transform words by deleting everything up to and including the 2nd slash" when there might be zero, one or two slashes. Any suggestions? >I think you might need something like this: s <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" l <- strsplit(s, " ")[[1]] pat <- "[[:alnum:]]*/[[:alnum:]]*/([[:alnum:]]*)" paste(ifelse(grepl(pat,l),gsub(pat, "\\1", l), ""), collapse=" ") -pd> Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
> x <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)[1] " 587 587 587 587"> y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd aa/bb/cc/dd/" > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)[1] " cc cc/ cc/dd cc/dd/" Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> I have a file containing "words" like > > > a > > a/b > > a/b/c > > where there may be multiple words on a line (separated by spaces). The a, > b, and c strings can contain non-space, non-slash characters. I'd like to > use gsub() to extract the c strings (which should be empty if there are > none). > > A real example is > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > which I'd like to transform to > > " 587 587 587 587" > > Another real example is > > "f 1067 28680 24462" > > which should transform to " ". > > I've tried a few different regexprs, but am unable to find a way to say > "transform words by deleting everything up to and including the 2nd slash" > when there might be zero, one or two slashes. Any suggestions? > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posti > ng-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
"(^| +)([^/ ]*/?){0,2}", with the first "*" replaced by "+" would be a bit better. Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Oct 9, 2017 at 8:50 AM, William Dunlap <wdunlap at tibco.com> wrote:> > x <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x) > [1] " 587 587 587 587" > > y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd aa/bb/cc/dd/" > > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y) > [1] " cc cc/ cc/dd cc/dd/" > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> I have a file containing "words" like >> >> >> a >> >> a/b >> >> a/b/c >> >> where there may be multiple words on a line (separated by spaces). The >> a, b, and c strings can contain non-space, non-slash characters. I'd like >> to use gsub() to extract the c strings (which should be empty if there are >> none). >> >> A real example is >> >> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" >> >> which I'd like to transform to >> >> " 587 587 587 587" >> >> Another real example is >> >> "f 1067 28680 24462" >> >> which should transform to " ". >> >> I've tried a few different regexprs, but am unable to find a way to say >> "transform words by deleting everything up to and including the 2nd slash" >> when there might be zero, one or two slashes. Any suggestions? >> >> Duncan Murdoch >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]
On 09/10/2017 11:23 AM, Ulrik Stervbo wrote:> Hi Duncan, > > why not split on / and take the correct elements? It is not as elegant > as regex but could do the trick.Thanks for the suggestion. There are likely many thousands of lines of data like the two real examples (which had about 5000 and 60000 lines respectively), so I was thinking that would be too slow, as it would involve nested strsplit() calls. But in fact, it's not so bad, so I might go with it. Here's a stab at it: lines <- <the lines to be split, e.g. the lines starting with "f" in http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726> l2 <- strsplit(lines, " ") l3 <- lapply(l2, function(x) { y <- strsplit(x, "/") sapply(y, function(z) if (length(z) == 3) z[3] else "") }) Duncan> > Best, > Ulrik > > On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> wrote: > > I have a file containing "words" like > > > a > > a/b > > a/b/c > > where there may be multiple words on a line (separated by spaces).? The > a, b, and c strings can contain non-space, non-slash characters. I'd > like to use gsub() to extract the c strings (which should be empty if > there are none). > > A real example is > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > which I'd like to transform to > > " 587 587 587 587" > > Another real example is > > "f 1067 28680 24462" > > which should transform to "?? ". > > I've tried a few different regexprs, but am unable to find a way to say > "transform words by deleting everything up to and including the 2nd > slash" when there might be zero, one or two slashes.? Any suggestions? > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
How about this (I'm showing it as a pipe because it's easier to read that way): library(magrittr) "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>% ? strsplit(' ') %>% ? unlist %>% ? sub('^[^/]*/*','',.) %>% ? sub('^[^/]*/*','',.) %>% ? paste(collapse = ' ') Georges Monette -- Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of Science, Department of Mathematics & Statistics | North 626 Ross Building | York University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone: 416-736-5250 | Fax: 416-736-5757 | E-Mail: georges at yorku.ca On 2017-10-09 11:02 AM, Duncan Murdoch wrote:> I have a file containing "words" like > > > a > > a/b > > a/b/c > > where there may be multiple words on a line (separated by spaces).? > The a, b, and c strings can contain non-space, non-slash characters. > I'd like to use gsub() to extract the c strings (which should be empty > if there are none). > > A real example is > > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" > > which I'd like to transform to > > " 587 587 587 587" > > Another real example is > > "f 1067 28680 24462" > > which should transform to "?? ". > > I've tried a few different regexprs, but am unable to find a way to > say "transform words by deleting everything up to and including the > 2nd slash" when there might be zero, one or two slashes.? Any > suggestions? > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
> On Oct 9, 2017, at 6:08 PM, Georges Monette <georges at yorku.ca> wrote: > > How about this (I'm showing it as a pipe because it's easier to read that way): > > library(magrittr) > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>% > strsplit(' ') %>% > unlist %>% > sub('^[^/]*/*','',.) %>% > sub('^[^/]*/*','',.) %>% > paste(collapse = ' ')I'm old school R, so I don't find that particularly readable. I read the later specification as saying each line began with an f, so the fourth item after an strsplit becomes the target. This seemed more readable to me: Lines <- readLines(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726")) lines <- Lines[ grepl("^f", Lines) ] str(lines) # chr [1:62908] "f 14327 6959 18747" "f 8258 15598 18980" "f 27662 21871 21939" ... l2 <- strsplit(lines, " ") # in that file the separators were spaces l3 <- sapply(l2[1:3], function(x) { if (length(x) == 4) x[4] else "" }) l3 #[1] "18747" "18980" "21939" # Remove the `[1:3]` to get the entire result. Best; David.> > Georges Monette > > -- > Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of Science, Department of Mathematics & Statistics | North 626 Ross Building | York University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone: 416-736-5250 | Fax: 416-736-5757 | E-Mail: georges at yorku.ca > > > On 2017-10-09 11:02 AM, Duncan Murdoch wrote: >> I have a file containing "words" like >> >> >> a >> >> a/b >> >> a/b/c >> >> where there may be multiple words on a line (separated by spaces). The a, b, and c strings can contain non-space, non-slash characters. I'd like to use gsub() to extract the c strings (which should be empty if there are none). >> >> A real example is >> >> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" >> >> which I'd like to transform to >> >> " 587 587 587 587" >> >> Another real example is >> >> "f 1067 28680 24462" >> >> which should transform to " ". >> >> I've tried a few different regexprs, but am unable to find a way to say "transform words by deleting everything up to and including the 2nd slash" when there might be zero, one or two slashes. Any suggestions? >> >> Duncan Murdoch >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law