Tony Breyal
2008-Oct-06 15:45 UTC
[R] Extracting text from html code using the RCurl package.
Dear R-help, I want to download the text from a web page, however what i end up with is the html code. Is there some option that i am missing in the RCurl package? Or is there another way to achieve this? This is the code i am using:> library(RCurl) > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help' > html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) > print(html.file)I thought perhaps the htmlTreeParse() function from the XML package might help, but I just don't know what to do next with it:> library(XML) > htmlTreeParse(html.file)Many thanks for any help you can provide, Tony Breyal> sessionInfo()R version 2.7.2 (2008-08-25) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_1.94-0 RCurl_0.9-4
Martin Morgan
2008-Oct-07 15:57 UTC
[R] Extracting text from html code using the RCurl package.
Hi Tony -- Tony Breyal <tony.breyal at googlemail.com> writes:> Dear R-help, > > I want to download the text from a web page, however what i end up > with is the html code. Is there some option that i am missing in the > RCurl package? Or is there another way to achieve this? This is the > code i am using: > >> library(RCurl) >> >> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) >> print(html.file) > > I thought perhaps the htmlTreeParse() function from the XML package > might help, but I just don't know what to do next with it: > >> library(XML) >> htmlTreeParse(html.file) > > Many thanks for any help you can provide,Sounds like you're on the right track. One way is to parse the html file into its 'internal' representation, and then use xpathApply to extract relevant information (e.g., the third 'p' (paragraph) element from the XML mark-up> html = htmlTreeParse(getURL(my.url), useInternal=TRUE)Opening and ending tag mismatch: td and font Unexpected end tag : p Unexpected end tag : form> xpathApply(html, "//p[3]", xmlValue)[[1]] [1] "You can subscribe to the list, or change your existing\r\n\t subscription, in the sections below.\r\n\t" the 'xpath' is the path from the root of the document through various nested tags to tags of the specified type. "//p", says 'start at the root ('/') and look in all sub-nodes (that this '//') for an 'p' tag. ?xpathApply. is a good starting place, as is http://www.w3.org/TR/xpath, especially http://www.w3.org/TR/xpath#path-abbrev Martin> Tony Breyal > > >> sessionInfo() > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] XML_1.94-0 RCurl_0.9-4 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
Gabor Grothendieck
2008-Oct-07 16:52 UTC
[R] Extracting text from html code using the RCurl package.
I gather you are using Windows and in that case you could use RDCOMClient or rcom to get it via Internet Explorer, e.g. library(RDCOMClient) ie <- COMCreate("InternetExplorer.Application") URL <- "https://stat.ethz.ch/mailman/listinfo/r-help" ie$Navigate(URL) while(ie[["Busy"]]) Sys.sleep(1) txt <- ie[["document"]][["body"]][["innerText"]] ie$Quit() You may need to run this in elevated mode if you are Vista. On Mon, Oct 6, 2008 at 11:45 AM, Tony Breyal <tony.breyal at googlemail.com> wrote:> Dear R-help, > > I want to download the text from a web page, however what i end up > with is the html code. Is there some option that i am missing in the > RCurl package? Or is there another way to achieve this? This is the > code i am using: > >> library(RCurl) >> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help' >> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) >> print(html.file) > > I thought perhaps the htmlTreeParse() function from the XML package > might help, but I just don't know what to do next with it: > >> library(XML) >> htmlTreeParse(html.file) > > Many thanks for any help you can provide, > Tony Breyal > > >> sessionInfo() > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] XML_1.94-0 RCurl_0.9-4 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Tony Breyal
2008-Oct-21 15:42 UTC
[R] Extracting text from html code using the RCurl package.
Thank you for your response both Martin and Gabor, very much appreciated! In case anyone does a search for this topic, i thought i'd write a few comments below on what I have ended up doing: re: Internet Explorer (IE) - Finding out that R can access IE was a very pleasant surprise! This works very well at extracting the plain text from a html formatted page. The only downsides for me were (1) it is rather slow if you wish to convert lots of html files into plain text files, even if the html files are already on your computer, and (2) when trying to convert some html files, an IE 'pop-up' window may show up and execution can not continue until that pop up has been dealt with. There may be ways around this, but I am not aware of them. ## This is an example of the code I used: library(RDCOMClient) urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help", "http://wiki.r-project.org/rwiki/doku.php?id=getting- started:what-is-r:what-is-r") ie <- COMCreate("InternetExplorer.Application") txt <- list() for(u in urls) { ie$Navigate(u) while(ie[["Busy"]]) Sys.sleep(1) txt[[u]] <- ie[["document"]][["body"]][["innerText"]] } ie$Quit() print(txt) re: xpathApply() - I must admit that this was a little confusing when I first encountered it after reading your post, but after some reading i think i have found out how to get what i want. This seems to work almost as well as IE above, but i have found this to be faster for my purposes probably because there is no need to wait for an external application, plus there is no danger of a 'pop-up' window showing. As far as i can tell, all plain text is extracted. library(RCurl) library(XML) urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help", "http://wiki.r-project.org/rwiki/doku.php?id=getting- started:what-is-r:what-is-r") html.files <- txt <- list() html.files <- getURL(urls, ssl.verifyhost = FALSE, ssl.verifypeer FALSE, followlocation = TRUE) for(u in urls) { html = htmlTreeParse(html.files[[u]], useInternal=TRUE) txt[[u]] <- toString(xpathApply(html, "//body//text() [not(ancestor::script)][not(ancestor::style)]", xmlValue)) } print(txt) Cheers, Tony Breyal On 6 Oct, 16:45, Tony Breyal <tony.bre... at googlemail.com> wrote:> Dear R-help, > > I want to download the text from a web page, however what i end up > with is thehtmlcode. Is there some option that i am missing in the > RCurl package? Or is there another way to achieve this? This is the > code i am using: > > > library(RCurl) > > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help' > >html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) > > print(html.file) > > I thought perhaps the htmlTreeParse() function from the XML package > might help, but I just don't know what to do next with it: > > > library(XML) > > htmlTreeParse(html.file) > > Many thanks for any help you can provide, > Tony Breyal > > > sessionInfo() > > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods > base > > other attached packages: > [1] XML_1.94-0 ?RCurl_0.9-4 > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Seemingly Similar Threads
- Re ad HTML table
- How to suppress errors from htmlTreeParse() function in XML package?
- RCurl unable to download a particular web page -- what is so special about this web page?
- [BioC] Rcurl 0.8-1 update for bioconductor 2.7
- XML and RCurl: problem with encoding (htmlTreeParse)