Dear All, Can someone please guide me how to get the certain part from a long html language? e.g. "<td><a href='2005-01.html'>2005-01</a></td><td><a href='2006-01.html'>2006-01</a></td><td><a href='2007-01.html'>2007-01</a></td><td><a href='2008-01.html'>2008-01</a></td><td><a href='2009-01.html'>2009-01</a></td>" How to get only the wording of "2005-01.html", "2006-01.html", "2007-01.html"," 2008-01.html"," 2009-01.html" from the above html code? I have tried to use gsub function, but not working. Please guide me on this. Thanks a lot. Rene. [[alternative HTML version deleted]]
Try using XML package: Lines <- "<td><a href='2005-01.html'>2005-01</a></td><td><a href='2006-01.html'>2006-01</a></td><td><a href='2007-01.html'>2007-01</a></td><td><a href='2008-01.html'>2008-01</a></td><td><a href='2009-01.html'>2009-01</a></td>" library(XML) xpathApply(htmlParse(Lines), "//a", xmlAttrs) On Wed, Sep 23, 2009 at 9:29 AM, Rene <kaixinmalea at gmail.com> wrote:> Dear All, > > > > Can someone please guide me how to get the certain part from a long html > language? > > > > e.g. > > > > "<td><a href='2005-01.html'>2005-01</a></td><td><a > href='2006-01.html'>2006-01</a></td><td><a > href='2007-01.html'>2007-01</a></td><td><a > href='2008-01.html'>2008-01</a></td><td><a > href='2009-01.html'>2009-01</a></td>" > > > > How to get only the wording of ?"2005-01.html", "2006-01.html", > "2007-01.html"," 2008-01.html"," 2009-01.html" from the above html code? I > have tried to use gsub function, but not working. > > > > Please guide me on this. > > > > Thanks a lot. > > Rene. > > > > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O
Hi, The R4X package can help you. (I have wrapped your td's into one tr) > x <- xml( "<tr><td><a href='2005-01.html'>2005-01</a></td><td><a + href='2006-01.html'>2006-01</a></td><td><a + href='2007-01.html'>2007-01</a></td><td><a + href='2008-01.html'>2008-01</a></td><td><a + href='2009-01.html'>2009-01</a></td></tr>" ) > x["td/a/#"] td td td td td "2005-01" "2006-01" "2007-01" "2008-01" "2009-01" > x["td/a/@href"] td td td td td "2005-01.html" "2006-01.html" "2007-01.html" "2008-01.html" "2009-01.html" Romain On 09/23/2009 02:29 PM, Rene wrote:> > Dear All, > > Can someone please guide me how to get the certain part from a long html > language? > > e.g. > > > > "<td><a href='2005-01.html'>2005-01</a></td><td><a > href='2006-01.html'>2006-01</a></td><td><a > href='2007-01.html'>2007-01</a></td><td><a > href='2008-01.html'>2008-01</a></td><td><a > href='2009-01.html'>2009-01</a></td>" > > > > How to get only the wording of "2005-01.html", "2006-01.html", > "2007-01.html"," 2008-01.html"," 2009-01.html" from the above html code? I > have tried to use gsub function, but not working. > > > > Please guide me on this. > > > > Thanks a lot. > > Rene.-- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://tr.im/ztCu : RGG #158:161: examples of package IDPmisc |- http://tr.im/yw8E : New R package : sos `- http://tr.im/y8y0 : search the graph gallery from R
maybe you could modify the following to suit your situation (i use this xPath expression to get links from google): ?htmlTreeParse ?getNodeSet> library(XML) > link <- 'http://www.google.co.uk/search?hl=en&client=firefox-a&rls=org.mozilla:en-GB:official&hs=2XR&ei=mxa6SojjOeaMjAfJkcDuBQ&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=Doctor+Who&spell=1' > html <- htmlTreeParse(link, useInternalNodes = TRUE, error=function(...){}) > nodes <- getNodeSet(html, "//a[@href][@class='l']") > sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])[1] "http://www.bbc.co.uk/ doctorwho/" [2] "http://www.bbc.co.uk/doctorwho/ classic/" [3] "http://en.wikipedia.org/wiki/ Doctor_Who" [4] "http://www.youtube.com/watch? v=LF2x5IKxmAQ" [5] "http://www.youtube.com/watch? v=DnKNupdSH8g" [6] "http://www.telegraph.co.uk/culture/tvandradio/doctor-who/6199603/ Doctor-Who-Top-10-fans-vote-for-all-time-best-episode.html" [7] "http://www.google.com/hostednews/ap/article/ALeqM5i17A4FXTLhJX10- sCbhhnhdqY9HwD9ASO6A00" [8] "http://www.telegraph.co.uk/news/newstopics/celebritynews/6200053/ Doctor-Who-star-David-Tennant-voted-pupils-dream-head-teacher.html" [9] "http://www.imdb.com/title/ tt0436992/" [10] "http://www.imdb.com/title/ tt0056751/" [11] "http:// www.gallifreyone.com/" [12] "http:// www.doctorwho.co.uk/" [13] "http:// www.drwhoguide.com/" [14] "http://www.bbcamerica.com/content/123/index.jsp" On 23 Sep, 13:29, "Rene" <kaixinma... at gmail.com> wrote:> Dear All, > > Can someone please guide me how to get the certain part from a long html > language? > > e.g. > > "<td><a href='2005-01.html'>2005-01</a></td><td><a > href='2006-01.html'>2006-01</a></td><td><a > href='2007-01.html'>2007-01</a></td><td><a > href='2008-01.html'>2008-01</a></td><td><a > href='2009-01.html'>2009-01</a></td>" > > How to get only the wording of ?"2005-01.html", "2006-01.html", > "2007-01.html"," 2008-01.html"," 2009-01.html" from the above html code? I > have tried to use gsub function, but not working. > > Please guide me on this. > > Thanks a lot. > > Rene. > > ? ? ? ? [[alternative HTML version deleted]] > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.