thr3ads.net - R help - [R] regex - extracting src url [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Omar André Gonzáles Díaz

2016-Mar-22 04:44 UTC

[R] regex - extracting src url

Hi,I have a DF with a column with "html", like this:

<IMG SRC="
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
BORDER="0" HEIGHT="1" WIDTH="1"
ALT="Advertisement">


I need to get this:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment?


I've got this so far:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement


With this is the code I've used:

carreras_normal$Impression.Tag..image. <-
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
                                  ignore.case = T)



*But I still need to use get rid of this part:*


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment?*\"
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement*


Thank you for your help.

Omar Gonz?les.

	[[alternative HTML version deleted]]

Bert Gunter

2016-Mar-22 05:13 UTC

head link

[R] regex - extracting src url

?strsplit  #I think
My "solution" assumes a fixed format for the URL's as shown in
your
example. If that is not the case, it doesn't work.
> y <- '<IMG
SRC="https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"+ BORDER="0" HEIGHT="1" WIDTH="1"
ALT="Advertisement">'
> y  ## checking that the URL is as expected
[1] "<IMG
SRC=\"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"\nBORDER=\"0\"
HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement\">"

> lapply(strsplit(y,"\""),"[",2) ## should work on a
vector of URL's, y
[[1]]
[1]
"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"



Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Mar 21, 2016 at 9:44 PM, Omar Andr? Gonz?les D?az
<oma.gonzales at gmail.com> wrote:> Hi,I have a DF with a column with "html", like this:
>
> <IMG SRC="
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
> BORDER="0" HEIGHT="1" WIDTH="1"
ALT="Advertisement">
>
>
> I need to get this:
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment>
?
>
>
> I've got this so far:
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
> BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement
>
>
> With this is the code I've used:
>
> carreras_normal$Impression.Tag..image. <-
>
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
>                                   ignore.case = T)
>
>
>
> *But I still need to use get rid of this part:*
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment>
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement*
>
>
> Thank you for your help.
>
> Omar Gonz?les.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Martin Morgan

2016-Mar-22 10:27 UTC

head link

[R] regex - extracting src url

On 03/22/2016 12:44 AM, Omar Andr? Gonz?les D?az wrote:> Hi,I have a DF with a column with "html", like this:
>
> <IMG SRC="
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
> BORDER="0" HEIGHT="1" WIDTH="1"
ALT="Advertisement">
>
>
> I need to get this:
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment>
?
>
>
> I've got this so far:
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
> BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement
>
>
> With this is the code I've used:
>
> carreras_normal$Impression.Tag..image. <-
>
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
>                                    ignore.case = T)
>
>
>
> *But I still need to use get rid of this part:*
>
>
>
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment>
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\"
ALT=\"Advertisement*
>
>
> Thank you for your help.
You're querying an xml string, so use xpath, e.g., via the XML library

 > as.character(xmlParse(y)[["//IMG/@SRC"]])
[1] 
"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"

`xmlParse()` translates the character string into  an XML document. `[[` 
subsets the document to extract a single element. "//IMG/@SRC" follows
the xpath specification (this section 
https://www.w3.org/TR/xpath-31/#abbrev of the specification provides a 
quick guide) to find, starting from the 'root' of the document, a node, 
at any depth, labeled IMG containing an attribute labeled SRC.

A variation, if there were several IMG tags to be extracted, would be

   xpathSApply(xmlParse(y), "//IMG/@SRC", as.character)
>
> Omar Gonz?les.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

This email message may contain legally privileged and/or confidential
information.  If you are not the intended recipient(s), or the employee or agent
responsible for the delivery of this message to the intended recipient(s), you
are hereby notified that any disclosure, copying, distribution, or use of this
email message is prohibited.  If you have received this message in error, please
notify the sender immediately by e-mail and delete this email message from your
computer. Thank you.

R help - Mar 2016 - regex - extracting src url

[R] regex - extracting src url

[R] regex - extracting src url

[R] regex - extracting src url