thr3ads.net - R help - [R] Extracting text from html code using the RCurl package. [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Tony Breyal

2008-Oct-06 15:45 UTC

[R] Extracting text from html code using the RCurl package.

Dear R-help,

I want to download the text from a web page, however what i end up
with is the html code. Is there some option that i am missing in the
RCurl package? Or is there another way to achieve this? This is the
code i am using:
> library(RCurl)
> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer =
FALSE, followlocation = TRUE)
> print(html.file)
I thought perhaps the htmlTreeParse() function from the XML package
might help, but I just don't know what to do next with it:
> library(XML)
> htmlTreeParse(html.file)
Many thanks for any help you can provide,
Tony Breyal

> sessionInfo()R version 2.7.2 (2008-08-25)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base

other attached packages:
[1] XML_1.94-0  RCurl_0.9-4

Martin Morgan

2008-Oct-07 15:57 UTC

head link

[R] Extracting text from html code using the RCurl package.

Hi Tony --

Tony Breyal <tony.breyal at googlemail.com> writes:
> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is the html code. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
>> library(RCurl)
>> 
>> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer =
FALSE, followlocation = TRUE)
>> print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
>> library(XML)
>> htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,
Sounds like you're on the right track. One way is to parse the html
file into its 'internal' representation, and then use xpathApply to
extract relevant information (e.g., the third 'p' (paragraph) element
from the XML mark-up
> html = htmlTreeParse(getURL(my.url), useInternal=TRUE)Opening and ending tag mismatch: td and font
Unexpected end tag : p
Unexpected end tag : form> xpathApply(html, "//p[3]", xmlValue)[[1]]
[1] "You can subscribe to the list, or change your existing\r\n\t   
subscription, in the sections below.\r\n\t"

the 'xpath' is the path from the root of the document through various
nested tags to tags of the specified type. "//p", says 'start at
the
root ('/') and look in all sub-nodes (that this '//') for an
'p'
tag. ?xpathApply.  is a good starting place, as is
http://www.w3.org/TR/xpath, especially

http://www.w3.org/TR/xpath#path-abbrev 

Martin
> Tony Breyal
>
>
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_1.94-0  RCurl_0.9-4
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

Gabor Grothendieck

2008-Oct-07 16:52 UTC

head link

[R] Extracting text from html code using the RCurl package.

I gather you are using Windows and in that case you could
use RDCOMClient or rcom to get it via Internet Explorer, e.g.

library(RDCOMClient)
ie <- COMCreate("InternetExplorer.Application")
URL <- "https://stat.ethz.ch/mailman/listinfo/r-help"
ie$Navigate(URL)
while(ie[["Busy"]]) Sys.sleep(1)
txt <-
ie[["document"]][["body"]][["innerText"]]
ie$Quit()

You may need to run this in elevated mode if you are Vista.

On Mon, Oct 6, 2008 at 11:45 AM, Tony Breyal <tony.breyal at
googlemail.com> wrote:> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is the html code. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
>> library(RCurl)
>> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
>> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer =
FALSE, followlocation = TRUE)
>> print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
>> library(XML)
>> htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,
> Tony Breyal
>
>
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_1.94-0  RCurl_0.9-4
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Tony Breyal

2008-Oct-21 15:42 UTC

head link

[R] Extracting text from html code using the RCurl package.

Thank you for your response both Martin and Gabor, very much
appreciated!

In case anyone does a search for this topic, i thought i'd write a few
comments below on what I have ended up doing:

re: Internet Explorer (IE) - Finding out that R can access IE was a
very pleasant surprise! This works very well at extracting the plain
text from a html formatted page. The only downsides for me were (1) it
is rather slow if you wish to convert lots of html files into plain
text files, even if the html files are already on your computer, and
(2) when trying to convert some html files, an IE 'pop-up' window may
show up and execution can not continue until that pop up has been
dealt with. There may be ways around this, but I am not aware of them.

## This is an example of the code I used:
library(RDCOMClient)
urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help",
          "http://wiki.r-project.org/rwiki/doku.php?id=getting-
started:what-is-r:what-is-r")
ie <- COMCreate("InternetExplorer.Application")
txt <- list()
for(u in urls) {
  ie$Navigate(u)
  while(ie[["Busy"]]) Sys.sleep(1)
  txt[[u]] <-
ie[["document"]][["body"]][["innerText"]]
}
ie$Quit()
print(txt)

re: xpathApply() - I must admit that this was a little confusing when
I first encountered it after reading your post, but after some reading
i think i have found out how to get what i want. This seems to work
almost as well as IE above, but i have found this to be faster for my
purposes probably because there is no need to wait for an external
application, plus there is no danger of a 'pop-up' window showing. As
far as i can tell, all plain text is extracted.

library(RCurl)
library(XML)
urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help",
          "http://wiki.r-project.org/rwiki/doku.php?id=getting-
started:what-is-r:what-is-r")
html.files <- txt <- list()
html.files <- getURL(urls, ssl.verifyhost = FALSE, ssl.verifypeer FALSE,
followlocation = TRUE)
for(u in urls) {
  html = htmlTreeParse(html.files[[u]], useInternal=TRUE)
  txt[[u]] <- toString(xpathApply(html, "//body//text()
[not(ancestor::script)][not(ancestor::style)]", xmlValue))
}
print(txt)

Cheers,
Tony Breyal

On 6 Oct, 16:45, Tony Breyal <tony.bre... at googlemail.com>
wrote:> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is thehtmlcode. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
> > library(RCurl)
> > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
> >html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer =
FALSE, followlocation = TRUE)
> > print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
> > library(XML)
> > htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,
> Tony Breyal
>
> > sessionInfo()
>
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods
> base
>
> other attached packages:
> [1] XML_1.94-0 ?RCurl_0.9-4
>
> ______________________________________________
> R-h... at r-project.org mailing
listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Oct 2008 - Extracting text from html code using the RCurl package.

[R] Extracting text from html code using the RCurl package.

[R] Extracting text from html code using the RCurl package.

[R] Extracting text from html code using the RCurl package.

[R] Extracting text from html code using the RCurl package.

Possibly Parallel Threads