Hi all,
Sorry for the rather uninformative subject, but the error I get is not very
informative either.
When using the XML and RCurl package to retrieve the content of an html page,
htmlTreeParse fails, printing out the beginning of the HTML:
Error in htmlTreeParse(getURL(url)) :
File <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="de" lang="de">
<head>
<title>Deutsches Krebsforschungszentrum</title>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1" />
<meta http-equiv="Content-Style-Type"
content="text/css" />
<meta http-equiv="imagetoolbar" content="no" />
<meta name="MSSmartTagsPreventParsing"
content="true" />
<meta name="revisit-after" content="5 days" />
<meta name="language" content="de" />
<meta lang="de" content="" xml:lang="de"
name="keywords">
<meta lang="de" xml:lang="de"
name="description" content="Das Deutsche Krebsforschungszentrum
hat die Aufgabe, die Mechanismen der Krebsentstehung systematisch zu erforschen
und Risikofaktoren f??r Krebserkrankungen zu erfassen. Aus den Ergebnissen
dieser grundlegenden Arbeiten sollen neue Ans?
This code reproduces the error:
library(RCurl)
library(XML)
url <-
"www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html"
htmlTreeParse(getURL(url))
The issue seems to originate in htmlTreeParse as getURL alone works and returns
the expected content. I checked that it could not be an encoding issue and as
far as I can tell it seems not to be.
Moreover, using htmlParse(paste("http://",url,sep="") works.
Note that htmlTreeParse(getURL(paste("http://",url,sep="")))
fails too, the "http://" is important only for htmlParse, so that it
identifies it as an URL.
This issue is rather new, and as I've been using the same version of XML and
RCurl, I suppose it might have to do with some of the content of the website
having been updated, but given the error, I can't quite figure out what is
raising it.
Although it works on that simple example, using htmlParse is not really a work
around, as I need to use additional arguments in the getURL call (such as
userpwd), which I can't provide to htmlParse.
Any hints would be greatly appreciated,
Cheers,
Nico
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.9-4 RCurl_1.91-1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.0
---------------------------------------------------------------
Nicolas Delhomme
Nathaniel Street Lab
Department of Plant Physiology
Ume? Plant Science Center
Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Ume? universitet
Ume? S-901 87 Sweden
---------------------------------------------------------------