Let's say that I have the following character vector with a series of url strings. I'm interested in extracting some information from each string. url = c("http://www.mdd.com/food/pizza/index.html", " http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", " http://www.genius.com/find-a-location.html", "http://www.google.com/hello.html") - First, I want to extract the domain name followed by .com. After struggling with this for a while, reading some regular expression tutorials, and reading through stack overflow, I came up with the following solution. Perfect!> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://","", x), "/"), "[[", 1))> parser(url)[1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" - Second, I want to extract everything after .com in the original url. Unfortunately, I don't know the proper regular expression to assign in order to get the desired result. Can anyone help. Output should be /food/pizza/index.html build-your-own/index.html /special-deals.html If anyone has a solution using the stringr package, that'd be of interest also. Thanks. -- *Abraham Mathew**Analytics Strategist* *Minneapolis, MN* *720-648-0108* *abmathewks@gmail.com <abmathewks@gmail.com>* *Twitter <https://twitter.com/abmathewks> **LinkedIn <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog <https://mathewanalytics.wordpress.com/> **Tumblr <http://iwearstyle.tumblr.com/> Pinterest <http://pinterest.com/amathew123/>* [[alternative HTML version deleted]]
Hi, The XML package has a nice function, parseURI(), that nicely slice and dices the url. library(XML) parseURI('http://www.mdd.com/food/pizza/index.html') Might that help? Cheers, Ben On Mar 6, 2014, at 12:23 PM, Abraham Mathew <abmathewks at gmail.com> wrote:> Let's say that I have the following character vector with a series of url > strings. I'm interested in extracting some information from each string. > > url = c("http://www.mdd.com/food/pizza/index.html", " > http://www.mdd.com/build-your-own/index.html", > "http://www.mdd.com/special-deals.html", " > http://www.genius.com/find-a-location.html", > "http://www.google.com/hello.html") > > - First, I want to extract the domain name followed by .com. After > struggling with this for a while, reading some regular expression > tutorials, and reading through stack overflow, I came up with the following > solution. Perfect! > >> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", > "", x), "/"), "[[", 1)) >> parser(url) > [1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" > > - Second, I want to extract everything after .com in the original url. > Unfortunately, I don't know the proper regular expression to assign in > order to get the desired result. Can anyone help. > > Output should be > /food/pizza/index.html > build-your-own/index.html > /special-deals.html > > If anyone has a solution using the stringr package, that'd be of interest > also. > > > Thanks. > > -- > > *Abraham Mathew**Analytics Strategist* > *Minneapolis, MN* > *720-648-0108* > > *abmathewks at gmail.com <abmathewks at gmail.com>* > *Twitter <https://twitter.com/abmathewks> **LinkedIn > <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog > <https://mathewanalytics.wordpress.com/> **Tumblr > <http://iwearstyle.tumblr.com/> Pinterest > <http://pinterest.com/amathew123/>* > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Oh, that's perfect. I can just use one of the apply functions to run that each url and then extract the methods that I need. Thanks! On Thu, Mar 6, 2014 at 11:52 AM, Ben Tupper <ben.bighair@gmail.com> wrote:> Hi, > > The XML package has a nice function, parseURI(), that nicely slice and > dices the url. > > library(XML) > parseURI('http://www.mdd.com/food/pizza/index.html') > > Might that help? > > Cheers, > Ben > > On Mar 6, 2014, at 12:23 PM, Abraham Mathew <abmathewks@gmail.com> wrote: > > > Let's say that I have the following character vector with a series of url > > strings. I'm interested in extracting some information from each string. > > > > url = c("http://www.mdd.com/food/pizza/index.html", " > > http://www.mdd.com/build-your-own/index.html", > > "http://www.mdd.com/special-deals.html", " > > http://www.genius.com/find-a-location.html", > > "http://www.google.com/hello.html") > > > > - First, I want to extract the domain name followed by .com. After > > struggling with this for a while, reading some regular expression > > tutorials, and reading through stack overflow, I came up with the > following > > solution. Perfect! > > > >> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http:// > ", > > "", x), "/"), "[[", 1)) > >> parser(url) > > [1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" > > > > - Second, I want to extract everything after .com in the original url. > > Unfortunately, I don't know the proper regular expression to assign in > > order to get the desired result. Can anyone help. > > > > Output should be > > /food/pizza/index.html > > build-your-own/index.html > > /special-deals.html > > > > If anyone has a solution using the stringr package, that'd be of interest > > also. > > > > > > Thanks. > > > > -- > > > > *Abraham Mathew**Analytics Strategist* > > *Minneapolis, MN* > > *720-648-0108* > > > > *abmathewks@gmail.com <abmathewks@gmail.com>* > > *Twitter <https://twitter.com/abmathewks> **LinkedIn > > <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog > > <https://mathewanalytics.wordpress.com/> **Tumblr > > <http://iwearstyle.tumblr.com/> Pinterest > > <http://pinterest.com/amathew123/>* > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > >-- *Abraham Mathew**Analytics Strategist* *Minneapolis, MN* *720-648-0108* *abmathewks@gmail.com <abmathewks@gmail.com>* *Twitter <https://twitter.com/abmathewks> **LinkedIn <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog <https://mathewanalytics.wordpress.com/> **Tumblr <http://iwearstyle.tumblr.com/> Pinterest <http://pinterest.com/amathew123/>* [[alternative HTML version deleted]]
Try: gsub(".*\\.com","",url) [1] "/food/pizza/index.html"???? "/build-your-own/index.html" [3] "/special-deals.html"??????? "/find-a-location.html"???? [5] "/hello.html"?????????????? ? gsub(".*www\\.([[:alpha:]]+\\.com).*","\\1",url) #[1] "mdd.com"??? "mdd.com"??? "mdd.com"??? "genius.com" "google.com" A.K. On Thursday, March 6, 2014 12:37 PM, Abraham Mathew <abmathewks at gmail.com> wrote: Let's say that I have the following character vector with a series of url strings. I'm interested in extracting some information from each string. url = c("http://www.mdd.com/food/pizza/index.html", " http://www.mdd.com/build-your-own/index.html", ? ? ? ? "http://www.mdd.com/special-deals.html", " http://www.genius.com/find-a-location.html", ? ? ? ? "http://www.google.com/hello.html") - First, I want to extract the domain name followed by .com. After struggling with this for a while, reading some regular expression tutorials, and reading through stack overflow, I came up with the following solution. Perfect!> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://","", x), "/"), "[[", 1))> parser(url)[1] "mdd.com"? ? "mdd.com"? ? "mdd.com"? ? "genius.com" "google.com" - Second, I want to extract everything after .com in the original url. Unfortunately, I don't know the proper regular expression to assign in order to get the desired result. Can anyone help. Output should be /food/pizza/index.html build-your-own/index.html /special-deals.html If anyone has a solution using the stringr package, that'd be of interest also. Thanks. -- *Abraham Mathew**Analytics Strategist* *Minneapolis, MN* *720-648-0108* *abmathewks at gmail.com <abmathewks at gmail.com>* *Twitter <https://twitter.com/abmathewks> **LinkedIn <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog <https://mathewanalytics.wordpress.com/> **Tumblr <http://iwearstyle.tumblr.com/> Pinterest <http://pinterest.com/amathew123/>* ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
There are many ways to do this. Here's a simple version and a slightly fancier version: url = c("http://www.mdd.com/food/pizza/index.html", "http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", "http://www.genius.com/find-a-location.html", "http://www.google.com/hello.html") url2 = c("http://www.mdd.com/food/pizza/index.html", "https://www.mdd.com/build-your-own/index.html", "http://www.mdd.edu/special-deals.html", "http://www.genius.gov/find-a-location.html", "http://www.google.com/hello.html") parse1 <- function(x) { # will work for https as well as http x <- sub("^http[s]*:\\/\\/", "", x) x <- sub("^www\\.", "", x) strsplit(x, "/")[[1]][1] } parse2 <- function(x) { # if you're sure it will always be .com strsplit(x, "\\.com")[[1]][2] } parse2a <- function(x) { # one way to split at any three-letter extension # assumes !S! won't appear in the URLs x <- sub("\\.[a-z]{3,3}\\/", "!S!\\/", x) strsplit(x, "!S!")[[1]][2] } sapply(url, parse1) sapply(url, parse2) sapply(url2, parse1) sapply(url2, parse2a) Sarah On Thu, Mar 6, 2014 at 12:23 PM, Abraham Mathew <abmathewks at gmail.com> wrote:> Let's say that I have the following character vector with a series of url > strings. I'm interested in extracting some information from each string. > > url = c("http://www.mdd.com/food/pizza/index.html", " > http://www.mdd.com/build-your-own/index.html", > "http://www.mdd.com/special-deals.html", " > http://www.genius.com/find-a-location.html", > "http://www.google.com/hello.html") > > - First, I want to extract the domain name followed by .com. After > struggling with this for a while, reading some regular expression > tutorials, and reading through stack overflow, I came up with the following > solution. Perfect! > >> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", > "", x), "/"), "[[", 1)) >> parser(url) > [1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" > > - Second, I want to extract everything after .com in the original url. > Unfortunately, I don't know the proper regular expression to assign in > order to get the desired result. Can anyone help. > > Output should be > /food/pizza/index.html > build-your-own/index.html > /special-deals.html > > If anyone has a solution using the stringr package, that'd be of interest > also. > > > Thanks. >-- Sarah Goslee http://www.functionaldiversity.org