thr3ads.net - R help - [R] Parsing aspects of a url path in R [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Abraham Mathew

2014-Mar-06 17:23 UTC

[R] Parsing aspects of a url path in R

Let's say that I have the following character vector with a series of url
strings. I'm interested in extracting some information from each string.

url = c("http://www.mdd.com/food/pizza/index.html", "
http://www.mdd.com/build-your-own/index.html",
        "http://www.mdd.com/special-deals.html", "
http://www.genius.com/find-a-location.html",
        "http://www.google.com/hello.html")

- First, I want to extract the domain name followed by .com. After
struggling with this for a while, reading some regular expression
tutorials, and reading through stack overflow, I came up with the following
solution. Perfect!
> parser <- function(x) gsub("www\\.", "",
sapply(strsplit(gsub("http://",
"", x), "/"), "[[", 1))> parser(url)[1] "mdd.com"    "mdd.com"    "mdd.com"   
"genius.com" "google.com"

- Second, I want to extract everything after .com in the original url.
Unfortunately, I don't know the proper regular expression to assign in
order to get the desired result. Can anyone help.

Output should be
/food/pizza/index.html
build-your-own/index.html
/special-deals.html

If anyone has a solution using the stringr package, that'd be of interest
also.


Thanks.

-- 

*Abraham Mathew**Analytics Strategist*
*Minneapolis, MN*
*720-648-0108*

*abmathewks@gmail.com <abmathewks@gmail.com>*
*Twitter <https://twitter.com/abmathewks> **LinkedIn
<http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
<https://mathewanalytics.wordpress.com/> **Tumblr
<http://iwearstyle.tumblr.com/> Pinterest
<http://pinterest.com/amathew123/>*

	[[alternative HTML version deleted]]

Ben Tupper

2014-Mar-06 17:52 UTC

head link

[R] Parsing aspects of a url path in R

Hi,

The XML package has a nice function, parseURI(), that nicely slice and dices the
url.

library(XML)
parseURI('http://www.mdd.com/food/pizza/index.html')

Might that help?

Cheers,
Ben

On Mar 6, 2014, at 12:23 PM, Abraham Mathew <abmathewks at gmail.com>
wrote:
> Let's say that I have the following character vector with a series of
url
> strings. I'm interested in extracting some information from each
string.
> 
> url = c("http://www.mdd.com/food/pizza/index.html", "
> http://www.mdd.com/build-your-own/index.html",
>        "http://www.mdd.com/special-deals.html", "
> http://www.genius.com/find-a-location.html",
>        "http://www.google.com/hello.html")
> 
> - First, I want to extract the domain name followed by .com. After
> struggling with this for a while, reading some regular expression
> tutorials, and reading through stack overflow, I came up with the following
> solution. Perfect!
> 
>> parser <- function(x) gsub("www\\.", "",
sapply(strsplit(gsub("http://",
> "", x), "/"), "[[", 1))
>> parser(url)
> [1] "mdd.com"    "mdd.com"    "mdd.com"   
"genius.com" "google.com"
> 
> - Second, I want to extract everything after .com in the original url.
> Unfortunately, I don't know the proper regular expression to assign in
> order to get the desired result. Can anyone help.
> 
> Output should be
> /food/pizza/index.html
> build-your-own/index.html
> /special-deals.html
> 
> If anyone has a solution using the stringr package, that'd be of
interest
> also.
> 
> 
> Thanks.
> 
> -- 
> 
> *Abraham Mathew**Analytics Strategist*
> *Minneapolis, MN*
> *720-648-0108*
> 
> *abmathewks at gmail.com <abmathewks at gmail.com>*
> *Twitter <https://twitter.com/abmathewks> **LinkedIn
> <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
> <https://mathewanalytics.wordpress.com/> **Tumblr
> <http://iwearstyle.tumblr.com/> Pinterest
> <http://pinterest.com/amathew123/>*
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Abraham Mathew

2014-Mar-06 18:12 UTC

head link

[R] Parsing aspects of a url path in R

Oh, that's perfect. I can just use one of the apply functions to run that
each url and then extract the methods that I need.


Thanks!





On Thu, Mar 6, 2014 at 11:52 AM, Ben Tupper <ben.bighair@gmail.com> wrote:
> Hi,
>
> The XML package has a nice function, parseURI(), that nicely slice and
> dices the url.
>
> library(XML)
> parseURI('http://www.mdd.com/food/pizza/index.html')
>
> Might that help?
>
> Cheers,
> Ben
>
> On Mar 6, 2014, at 12:23 PM, Abraham Mathew <abmathewks@gmail.com>
wrote:
>
> > Let's say that I have the following character vector with a series
of url
> > strings. I'm interested in extracting some information from each
string.
> >
> > url = c("http://www.mdd.com/food/pizza/index.html", "
> > http://www.mdd.com/build-your-own/index.html",
> >        "http://www.mdd.com/special-deals.html", "
> > http://www.genius.com/find-a-location.html",
> >        "http://www.google.com/hello.html")
> >
> > - First, I want to extract the domain name followed by .com. After
> > struggling with this for a while, reading some regular expression
> > tutorials, and reading through stack overflow, I came up with the
> following
> > solution. Perfect!
> >
> >> parser <- function(x) gsub("www\\.", "",
sapply(strsplit(gsub("http://
> ",
> > "", x), "/"), "[[", 1))
> >> parser(url)
> > [1] "mdd.com"    "mdd.com"    "mdd.com" 
"genius.com" "google.com"
> >
> > - Second, I want to extract everything after .com in the original url.
> > Unfortunately, I don't know the proper regular expression to
assign in
> > order to get the desired result. Can anyone help.
> >
> > Output should be
> > /food/pizza/index.html
> > build-your-own/index.html
> > /special-deals.html
> >
> > If anyone has a solution using the stringr package, that'd be of
interest
> > also.
> >
> >
> > Thanks.
> >
> > --
> >
> > *Abraham Mathew**Analytics Strategist*
> > *Minneapolis, MN*
> > *720-648-0108*
> >
> > *abmathewks@gmail.com <abmathewks@gmail.com>*
> > *Twitter <https://twitter.com/abmathewks> **LinkedIn
> > <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
> > <https://mathewanalytics.wordpress.com/> **Tumblr
> > <http://iwearstyle.tumblr.com/> Pinterest
> > <http://pinterest.com/amathew123/>*
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>

-- 

*Abraham Mathew**Analytics Strategist*
*Minneapolis, MN*
*720-648-0108*

*abmathewks@gmail.com <abmathewks@gmail.com>*
*Twitter <https://twitter.com/abmathewks> **LinkedIn
<http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
<https://mathewanalytics.wordpress.com/> **Tumblr
<http://iwearstyle.tumblr.com/> Pinterest
<http://pinterest.com/amathew123/>*

	[[alternative HTML version deleted]]

arun

2014-Mar-06 18:13 UTC

head link

[R] Parsing aspects of a url path in R

Try:
gsub(".*\\.com","",url)
[1] "/food/pizza/index.html"????
"/build-your-own/index.html"
[3] "/special-deals.html"??????? "/find-a-location.html"????
[5] "/hello.html"?????????????? 


? gsub(".*www\\.([[:alpha:]]+\\.com).*","\\1",url)
#[1] "mdd.com"??? "mdd.com"??? "mdd.com"???
"genius.com" "google.com"
A.K.


On Thursday, March 6, 2014 12:37 PM, Abraham Mathew <abmathewks at
gmail.com> wrote:
Let's say that I have the following character vector with a series of url
strings. I'm interested in extracting some information from each string.

url = c("http://www.mdd.com/food/pizza/index.html", "
http://www.mdd.com/build-your-own/index.html",
? ? ? ? "http://www.mdd.com/special-deals.html", "
http://www.genius.com/find-a-location.html",
? ? ? ? "http://www.google.com/hello.html")

- First, I want to extract the domain name followed by .com. After
struggling with this for a while, reading some regular expression
tutorials, and reading through stack overflow, I came up with the following
solution. Perfect!
> parser <- function(x) gsub("www\\.", "",
sapply(strsplit(gsub("http://",
"", x), "/"), "[[", 1))> parser(url)[1] "mdd.com"? ? "mdd.com"? ? "mdd.com"? ?
"genius.com" "google.com"

- Second, I want to extract everything after .com in the original url.
Unfortunately, I don't know the proper regular expression to assign in
order to get the desired result. Can anyone help.

Output should be
/food/pizza/index.html
build-your-own/index.html
/special-deals.html

If anyone has a solution using the stringr package, that'd be of interest
also.


Thanks.

-- 

*Abraham Mathew**Analytics Strategist*
*Minneapolis, MN*
*720-648-0108*

*abmathewks at gmail.com <abmathewks at gmail.com>*
*Twitter <https://twitter.com/abmathewks> **LinkedIn
<http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog
<https://mathewanalytics.wordpress.com/> **Tumblr
<http://iwearstyle.tumblr.com/> Pinterest
<http://pinterest.com/amathew123/>*

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Sarah Goslee

2014-Mar-06 19:41 UTC

head link

[R] Parsing aspects of a url path in R

There are many ways to do this. Here's a simple version and a slightly
fancier version:


url = c("http://www.mdd.com/food/pizza/index.html",
"http://www.mdd.com/build-your-own/index.html",
"http://www.mdd.com/special-deals.html",
"http://www.genius.com/find-a-location.html",
"http://www.google.com/hello.html")


url2 = c("http://www.mdd.com/food/pizza/index.html",
"https://www.mdd.com/build-your-own/index.html",
"http://www.mdd.edu/special-deals.html",
"http://www.genius.gov/find-a-location.html",
"http://www.google.com/hello.html")


parse1 <- function(x) {
    # will work for https as well as http
    x <- sub("^http[s]*:\\/\\/", "", x)
    x <- sub("^www\\.", "", x)
    strsplit(x, "/")[[1]][1]
}

parse2 <- function(x) {
    # if you're sure it will always be .com
    strsplit(x, "\\.com")[[1]][2]
}

parse2a <- function(x) {
    # one way to split at any three-letter extension
    # assumes !S! won't appear in the URLs
    x <- sub("\\.[a-z]{3,3}\\/", "!S!\\/", x)
    strsplit(x, "!S!")[[1]][2]
}

sapply(url, parse1)
sapply(url, parse2)

sapply(url2, parse1)
sapply(url2, parse2a)


Sarah

On Thu, Mar 6, 2014 at 12:23 PM, Abraham Mathew <abmathewks at gmail.com>
wrote:> Let's say that I have the following character vector with a series of
url
> strings. I'm interested in extracting some information from each
string.
>
> url = c("http://www.mdd.com/food/pizza/index.html", "
> http://www.mdd.com/build-your-own/index.html",
>         "http://www.mdd.com/special-deals.html", "
> http://www.genius.com/find-a-location.html",
>         "http://www.google.com/hello.html")
>
> - First, I want to extract the domain name followed by .com. After
> struggling with this for a while, reading some regular expression
> tutorials, and reading through stack overflow, I came up with the following
> solution. Perfect!
>
>> parser <- function(x) gsub("www\\.", "",
sapply(strsplit(gsub("http://",
> "", x), "/"), "[[", 1))
>> parser(url)
> [1] "mdd.com"    "mdd.com"    "mdd.com"   
"genius.com" "google.com"
>
> - Second, I want to extract everything after .com in the original url.
> Unfortunately, I don't know the proper regular expression to assign in
> order to get the desired result. Can anyone help.
>
> Output should be
> /food/pizza/index.html
> build-your-own/index.html
> /special-deals.html
>
> If anyone has a solution using the stringr package, that'd be of
interest
> also.
>
>
> Thanks.
>
-- 
Sarah Goslee
http://www.functionaldiversity.org

R help - Mar 2014 - Parsing aspects of a url path in R

[R] Parsing aspects of a url path in R

[R] Parsing aspects of a url path in R

[R] Parsing aspects of a url path in R

[R] Parsing aspects of a url path in R

[R] Parsing aspects of a url path in R