thr3ads.net - R help - [R] retrieve certain part from html [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Rene

2009-Sep-23 12:29 UTC

[R] retrieve certain part from html

Dear All,

 

Can someone please guide me how to get the certain part from a long html
language?

 

e.g. 

 

"<td><a
href='2005-01.html'>2005-01</a></td><td><a
href='2006-01.html'>2006-01</a></td><td><a
href='2007-01.html'>2007-01</a></td><td><a
href='2008-01.html'>2008-01</a></td><td><a
href='2009-01.html'>2009-01</a></td>"

 

How to get only the wording of  "2005-01.html",
"2006-01.html",
"2007-01.html"," 2008-01.html"," 2009-01.html"
from the above html code? I
have tried to use gsub function, but not working.

 

Please guide me on this.

 

Thanks a lot.

Rene.

 


	[[alternative HTML version deleted]]

Henrique Dallazuanna

2009-Sep-23 12:39 UTC

head link

[R] retrieve certain part from html

Try using XML package:

Lines <- "<td><a
href='2005-01.html'>2005-01</a></td><td><a
href='2006-01.html'>2006-01</a></td><td><a
href='2007-01.html'>2007-01</a></td><td><a
href='2008-01.html'>2008-01</a></td><td><a
href='2009-01.html'>2009-01</a></td>"

library(XML)
xpathApply(htmlParse(Lines), "//a", xmlAttrs)

On Wed, Sep 23, 2009 at 9:29 AM, Rene <kaixinmalea at gmail.com>
wrote:> Dear All,
>
>
>
> Can someone please guide me how to get the certain part from a long html
> language?
>
>
>
> e.g.
>
>
>
> "<td><a
href='2005-01.html'>2005-01</a></td><td><a
> href='2006-01.html'>2006-01</a></td><td><a
> href='2007-01.html'>2007-01</a></td><td><a
> href='2008-01.html'>2008-01</a></td><td><a
> href='2009-01.html'>2009-01</a></td>"
>
>
>
> How to get only the wording of ?"2005-01.html",
"2006-01.html",
> "2007-01.html"," 2008-01.html","
2009-01.html" from the above html code? I
> have tried to use gsub function, but not working.
>
>
>
> Please guide me on this.
>
>
>
> Thanks a lot.
>
> Rene.
>
>
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paran?-Brasil
25? 25' 40" S 49? 16' 22" O

Romain Francois

2009-Sep-23 12:39 UTC

head link

[R] retrieve certain part from html

Hi,

The R4X package can help you. (I have wrapped your td's into one tr)

 > x <- xml( "<tr><td><a
href='2005-01.html'>2005-01</a></td><td><a
+ href='2006-01.html'>2006-01</a></td><td><a
+ href='2007-01.html'>2007-01</a></td><td><a
+ href='2008-01.html'>2008-01</a></td><td><a
+ href='2009-01.html'>2009-01</a></td></tr>" )

 > x["td/a/#"]
        td        td        td        td        td
"2005-01" "2006-01" "2007-01" "2008-01"
"2009-01"
 > x["td/a/@href"]
             td             td             td             td             td
"2005-01.html" "2006-01.html" "2007-01.html"
"2008-01.html" "2009-01.html"

Romain

On 09/23/2009 02:29 PM, Rene wrote:>
> Dear All,
>
> Can someone please guide me how to get the certain part from a long html
> language?
>
> e.g.
>
>
>
> "<td><a
href='2005-01.html'>2005-01</a></td><td><a
> href='2006-01.html'>2006-01</a></td><td><a
> href='2007-01.html'>2007-01</a></td><td><a
> href='2008-01.html'>2008-01</a></td><td><a
> href='2009-01.html'>2009-01</a></td>"
>
>
>
> How to get only the wording of  "2005-01.html",
"2006-01.html",
> "2007-01.html"," 2008-01.html","
2009-01.html" from the above html code? I
> have tried to use gsub function, but not working.
>
>
>
> Please guide me on this.
>
>
>
> Thanks a lot.
>
> Rene.
-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://tr.im/ztCu : RGG #158:161: examples of package IDPmisc
|- http://tr.im/yw8E : New R package : sos
`- http://tr.im/y8y0 : search the graph gallery from R

Tony Breyal

2009-Sep-23 12:43 UTC

head link

[R] retrieve certain part from html

maybe you could modify the following to suit your situation (i use
this xPath expression to get links from google):

?htmlTreeParse
?getNodeSet
> library(XML)
> link <-
'http://www.google.co.uk/search?hl=en&client=firefox-a&rls=org.mozilla:en-GB:official&hs=2XR&ei=mxa6SojjOeaMjAfJkcDuBQ&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=Doctor+Who&spell=1'
> html <- htmlTreeParse(link, useInternalNodes = TRUE,
error=function(...){})
> nodes <- getNodeSet(html, "//a[@href][@class='l']")
> sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]) [1] "http://www.bbc.co.uk/
doctorwho/"
 [2] "http://www.bbc.co.uk/doctorwho/
classic/"
 [3] "http://en.wikipedia.org/wiki/
Doctor_Who"
 [4] "http://www.youtube.com/watch?
v=LF2x5IKxmAQ"
 [5] "http://www.youtube.com/watch?
v=DnKNupdSH8g"
 [6] "http://www.telegraph.co.uk/culture/tvandradio/doctor-who/6199603/
Doctor-Who-Top-10-fans-vote-for-all-time-best-episode.html"
 [7] "http://www.google.com/hostednews/ap/article/ALeqM5i17A4FXTLhJX10-
sCbhhnhdqY9HwD9ASO6A00"
 [8] "http://www.telegraph.co.uk/news/newstopics/celebritynews/6200053/
Doctor-Who-star-David-Tennant-voted-pupils-dream-head-teacher.html"
 [9] "http://www.imdb.com/title/
tt0436992/"
[10] "http://www.imdb.com/title/
tt0056751/"
[11] "http://
www.gallifreyone.com/"
[12] "http://
www.doctorwho.co.uk/"
[13] "http://
www.drwhoguide.com/"
[14] "http://www.bbcamerica.com/content/123/index.jsp"



On 23 Sep, 13:29, "Rene" <kaixinma... at gmail.com>
wrote:> Dear All,
>
> Can someone please guide me how to get the certain part from a long html
> language?
>
> e.g.
>
> "<td><a
href='2005-01.html'>2005-01</a></td><td><a
> href='2006-01.html'>2006-01</a></td><td><a
> href='2007-01.html'>2007-01</a></td><td><a
> href='2008-01.html'>2008-01</a></td><td><a
> href='2009-01.html'>2009-01</a></td>"
>
> How to get only the wording of ?"2005-01.html",
"2006-01.html",
> "2007-01.html"," 2008-01.html","
2009-01.html" from the above html code? I
> have tried to use gsub function, but not working.
>
> Please guide me on this.
>
> Thanks a lot.
>
> Rene.
>
> ? ? ? ? [[alternative HTML version deleted]]
>
> ______________________________________________
> R-h... at r-project.org mailing
listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more apparently analagous threads

R help - Sep 2009 - retrieve certain part from html

[R] retrieve certain part from html

[R] retrieve certain part from html

[R] retrieve certain part from html

[R] retrieve certain part from html

Maybe Matching Threads