Omar André Gonzáles Díaz
2018-Feb-21 05:19 UTC
[R] regex for "[2440810] / www.tinyurl.com/hgaco4fha3"
Hi, I need help for cleaning this: "[2440810] / www.tinyurl.com/hgaco4fha3" My desired output is: "[2440810] / tinyurl". My attemps: stringa <- "[2440810] / www.tinyurl.com/hgaco4fha3" b <- sub('^www.', '', stringa) #wanted to get rid of "www." part. Until first dot. b <- sub('[.].*', '', b) #clean from ".com" until the end. b #returns ""[2440810] / www" Thank you. [[alternative HTML version deleted]]
Ulrik Stervbo
2018-Feb-21 06:15 UTC
[R] regex for "[2440810] / www.tinyurl.com/hgaco4fha3"
Hi Omar, you are almost there.... but! Your first substitution looks 'www' as the start of the line followed by anything (which then do nothing), so your second substitution removes everything from the first '.' to be found (which is the one after www). What you want to do is x <- "[2440810] / www.tinyurl.com/hgaco4fha3" y <- sub('www\\.', '', x) # Note the escape of '.' y <- sub('\\..*', '', y) y Altrenatively, all in one (if all addresses are .com) gsub("(www\\.|\\.com.*)", "", x) And the same using stringr library(stringr) x %>% str_replace_all("(www\\.|\\.com.*)", "") HTH Ulrik On Wed, 21 Feb 2018 at 06:20 Omar Andr? Gonz?les D?az < oma.gonzales at gmail.com> wrote:> Hi, I need help for cleaning this: > > "[2440810] / www.tinyurl.com/hgaco4fha3" > > My desired output is: > > "[2440810] / tinyurl". > > My attemps: > > stringa <- "[2440810] / www.tinyurl.com/hgaco4fha3" > > b <- sub('^www.', '', stringa) #wanted to get rid of "www." part. Until > first dot. > > b <- sub('[.].*', '', b) #clean from ".com" until the end. > > b #returns ""[2440810] / www" > > Thank you. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
These are always kind of fun, not least because of the variety of different replies that "work" at least somewhat. Here's mine:> stringa <- "[2440810] / www.tinyurl.com/hgaco4fha3"> sub("^(.+)www\\.(.+)\\.com.+","\\1\\2",stringa)[1] "[2440810] / tinyurl" Note the use of doubled backslashes to escape the regex metacharacters. See ?regexp for details. Cheers, Bert On Tue, Feb 20, 2018 at 9:19 PM, Omar Andr? Gonz?les D?az < oma.gonzales at gmail.com> wrote:> Hi, I need help for cleaning this: > > "[2440810] / www.tinyurl.com/hgaco4fha3" > > My desired output is: > > "[2440810] / tinyurl". > > My attemps: > > stringa <- "[2440810] / www.tinyurl.com/hgaco4fha3" > > b <- sub('^www.', '', stringa) #wanted to get rid of "www." part. Until > first dot. > > b <- sub('[.].*', '', b) #clean from ".com" until the end. > > b #returns ""[2440810] / www" > > Thank you. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]