thr3ads.net - R help - [R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page) [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Humphrey Zhao

2015-Jun-16 13:01 UTC

[R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page)

Dear?Sir/Madam:

Thank?you for your attention to my question.?I?have downloaded?the source code
of some
web?pages?by?RCurl,?and?I?am?trying?to?extract?the?URL?from?them.?In?these?web?pages,?there?are?many?nodes?contains?the?same?URL,?such?like?the?followings:

<a?href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/\"?rel=\"bookmark\">

<a?href=\"http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/\"?target=\"_blank\">

<a?href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/#more-10947\"?class=\"more-link\">

I?want?to?accurately?choose?the?URL?I?need(the "href"?in
the?first?one),?and?I?tried?many?ways?the?most?accuracy?is?just?like?the?following:

library(XML)

#links<-getHTMLLinks(base.html,?xpQuery?=?"//a/@href")

links<-getHTMLLinks(base.html,?xpQuery?=?c("//a/href[@rel='bookmark']"))

However,?I?still?believe?that?there?is?a?correct?method?to?do?this?very?well,?but?I?could?not?find?it.?I?wonder?if?you?could?give?me?some?advice?on?solving?this?problem.?And?I?would?be?most?grateful?if?you?could?reply?at?your?earliest?convenience.?Looking?forward?to?hearing?from?you.?Thank?you?very?much.

?????????????????????????????????????Sincerely?yours?

?????????????????????????????????????Humphrey?Zhao
	[[alternative HTML version deleted]]

Boris Steipe

2015-Jun-16 17:17 UTC

head link

[R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page)

Humphrey -

Any "correct" method requires you to specify _uniquely_ what you are
looking for. If the bookmark keyword is necessary and unique, it appears you
have a working solution. Or what else where you trying to accomplish?

Cheers,
Boris


On Jun 16, 2015, at 9:01 AM, Humphrey Zhao <humphrey.zhao at yahoo.com>
wrote:
> Dear Sir/Madam:
> 
> Thank you for your attention to my question. I have downloaded the source
code of some web pages by RCurl, and I am trying to extract the URL from them.
In these web pages, there are many nodes contains the same URL, such like the
followings:
> 
> <a
href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/\"
rel=\"bookmark\">
> 
> <a
href=\"http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/\"
target=\"_blank\">
> 
> <a
href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/#more-10947\"
class=\"more-link\">
> 
> I want to accurately choose the URL I need(the "href" in the
first one), and I tried many ways the most accuracy is just like the following:
> 
> library(XML)
> 
> #links<-getHTMLLinks(base.html, xpQuery = "//a/@href")
> 
> links<-getHTMLLinks(base.html, xpQuery =
c("//a/href[@rel='bookmark']"))
> 
> However, I still believe that there is a correct method to do this very
well, but I could not find it. I wonder if you could give me some advice on
solving this problem. And I would be most grateful if you could reply at your
earliest convenience. Looking forward to hearing from you. Thank you very much.
> 
>                                      Sincerely yours 
> 
>                                      Humphrey Zhao
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Jun 2015 - Question about XML package (accurately access one attribute in an multi-attribution node on the web page)

[R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page)

[R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page)