Andre Zege
2013-Mar-20 17:07 UTC
[R] htmlParse (from XML library) working sporadically in the same code
I am using htmlParse from XML library on a paricular website. Sometimes code fails, sometimes it works, most of the time id doesn't and i cannot see why. The file i am trying to parse is http://www.londonstockexchange.com/exchange/prices-and-markets/international-markets/indices/home/sp-500.html?page=0 Sometimes the following code works n<-readHTMLTable(htmlParse(url)) But most of the time it would return the following error coming from htmlParse: Error: failed to load HTTP resource Error is coming from the following line in htmlParse code: ans <- .Call("RS_XML_ParseTree", as.character(file), handlers, as.logical(ignoreBlanks), as.logical(replaceEntities), as.logical(asText), as.logical(trim), as.logical(validate), as.logical(getDTD), as.logical(isURL), as.logical(addAttributeNamespaces), as.logical(useInternalNodes), as.logical(isHTML), as.logical(isSchema), as.logical(fullNamespaceInfo), as.character(encoding), as.logical(useDotNames), xinclude, error, addFinalizer, as.integer(options), PACKAGE = "XML") By the way, readHTMLTable(htmlParse(url)) works fine on other pages, so the problem is somehow related to this page. I am using 64-bit R.15.3 version on windows machine Thanks much Andre [[alternative HTML version deleted]]
Duncan Temple Lang
2013-Mar-20 18:18 UTC
[R] htmlParse (from XML library) working sporadically in the same code
When readHTMLTable() or more generally the HTML/XML parser fails to retrieve a URL, I suggest you use check to see if a different approach will work. You can use the download.file() function or readLines(url()) or getURLContent() from the RCurl package to get the content of the URL. The you can pass that content to readHTMLTable() via readHTMLTable(htmlParse(text, asText = TRUE)) or readHTMLTable(text, asText = TRUE) D. On 3/20/13 10:07 AM, Andre Zege wrote:> I am using htmlParse from XML library on a paricular website. Sometimes code fails, sometimes it works, most of the time id doesn't and i cannot see why. The file i am trying to parse is > > http://www.londonstockexchange.com/exchange/prices-and-markets/international-markets/indices/home/sp-500.html?page=0 > > > Sometimes the following code works > n<-readHTMLTable(htmlParse(url)) > > > But most of the time it would return the following error coming from htmlParse: > > Error: failed to load HTTP resource > > > Error is coming from the following line in htmlParse code: > > ans <- .Call("RS_XML_ParseTree", as.character(file), handlers, as.logical(ignoreBlanks), as.logical(replaceEntities), as.logical(asText), as.logical(trim), as.logical(validate), as.logical(getDTD), as.logical(isURL), as.logical(addAttributeNamespaces), as.logical(useInternalNodes), as.logical(isHTML), as.logical(isSchema), as.logical(fullNamespaceInfo), as.character(encoding), as.logical(useDotNames), xinclude, error, addFinalizer, as.integer(options), PACKAGE = "XML") > > > > By the way, readHTMLTable(htmlParse(url)) works fine on other pages, so the problem is somehow related to this page. > > I am using 64-bit R.15.3 version on windows machine > > Thanks much > Andre > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >