thr3ads.net - R help - [R] XML htmlTreeParse fails with no obvious error [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Nicolas Delhomme

2012-Jun-08 12:34 UTC

[R] XML htmlTreeParse fails with no obvious error

Hi all,

Sorry for the rather uninformative subject, but the error I get is not very
informative either.

When using the XML and RCurl package to retrieve the content of an html page,
htmlTreeParse fails, printing out the beginning of the HTML:

Error in htmlTreeParse(getURL(url)) : 
  File   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="de" lang="de">
    <head>
      <title>Deutsches Krebsforschungszentrum</title>
      <meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1" />
      <meta http-equiv="Content-Style-Type"
content="text/css" />
      <meta http-equiv="imagetoolbar" content="no" />
      <meta name="MSSmartTagsPreventParsing"
content="true" />
      <meta name="revisit-after" content="5 days" />
      <meta name="language" content="de" />
      <meta lang="de" content="" xml:lang="de"
name="keywords">
      <meta lang="de" xml:lang="de"
name="description" content="Das Deutsche Krebsforschungszentrum
hat die Aufgabe, die Mechanismen der Krebsentstehung systematisch zu erforschen
und Risikofaktoren f??r Krebserkrankungen zu erfassen. Aus den Ergebnissen
dieser grundlegenden Arbeiten sollen neue Ans?

This code reproduces the error:

library(RCurl)
library(XML)
url <-
"www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html"
htmlTreeParse(getURL(url))

The issue seems to originate in htmlTreeParse as getURL alone works and returns
the expected content. I checked that it could not be an encoding issue and as
far as I can tell it seems not to be.

Moreover, using htmlParse(paste("http://",url,sep="") works.
Note that htmlTreeParse(getURL(paste("http://",url,sep="")))
fails too, the "http://" is important only for htmlParse, so that it
identifies it as an URL.

This issue is rather new, and as I've been using the same version of XML and
RCurl, I suppose it might have to do with some of the content of the website
having been updated, but given the error, I can't quite figure out what is
raising it.

Although it works on that simple example, using htmlParse is not really a work
around, as I need to use additional arguments in the getURL call (such as
userpwd), which I can't provide to htmlParse.

Any hints would be greatly appreciated,

Cheers,

Nico

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] XML_3.9-4      RCurl_1.91-1   bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] tools_2.15.0

---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Ume? Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Ume? universitet
Ume? S-901 87 Sweden
---------------------------------------------------------------

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Jun 2012 - XML htmlTreeParse fails with no obvious error

[R] XML htmlTreeParse fails with no obvious error

Apparently Analagous Threads

Wisdom of the Ancients