Hi, I'm trying to download some data from the web and am running into problems with 'embedded null' characters. These seem to indicate to R that it should stop processing the page so I'd like to remove them. I've been looking around and can't seem to identify exactly what the character is and consequently how to remove it. # THE CODE WORKS ON THIS PAGE library(RCurl) library(XML) theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team" webpage <- getURL(theurl) # BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS theurl <- "http://screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21" webpage <- getURL(theurl) Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : Failed writing body (1371 != 1461) In addition: Warning messages: 1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : truncating string with embedded nul: 'ttp://finance. ## I DELETED SOME HERE FOR BREVITY## al>\nData and [... truncated] 2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : only read 1371 of the 1461 input bytes/characters # THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER destfile<-"file:///C:/projects/stock data/data/test.htm" download.file ( theurl , destfile , quiet = TRUE ) # WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE # PROBLEM AND THEN GETTING RID OF IT. I'd appreciate any advice. -- Best regards, David Young Marketing and Statistical Consultant Madrid, Spain +34 913 540 381 http://www.linkedin.com/in/europedavidyoung mailto:dyoung at telefonica.net
Duncan Temple Lang
2009-Oct-16 15:58 UTC
[R] Removing Embedded Null characters from text/html
[David contacted me directly, so I am sending my off-line reply to the list just for the record in case others encounter a simple problem.] Hi David. No problem contacting me at all. I saw your mail at one point on the mailing list, but didn't have a chance to respond. Indeed, it seems like there is some embedded null in the string. I need to investigate more about what is happening with the encoding, etc. and whether it is on the RCurl or R side. But for the meantime, the following two approaches seem to get around the problem: 1) just use htmlParse(url) on the URL directly, i.e. don't use RCurl. We only need basic HTTP facilities and htmlParse() (or more specifically libxml2) provides these for us. 2) If you need RCurl to manage the connection and communication for the HTTP request, use txt = rawToChar(getURLContent(url, binary = TRUE)) # You'll see a warning about truncation htmlParse(txt, asText = TRUE) BTW, use htmlTreeParse() or htmlParse(). I use the latter and then XPath expression via getNodeSet() or xpathApply() to extract content from the document. HTH, D. David Young wrote:> Hi, > > I'm trying to download some data from the web and am running into > problems with 'embedded null' characters. These seem to indicate to R > that it should stop processing the page so I'd like to remove them. > I've been looking around and can't seem to identify exactly what the > character is and consequently how to remove it. > > # THE CODE WORKS ON THIS PAGE > library(RCurl) > library(XML) > theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team" > webpage <- getURL(theurl) > > # BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS > theurl <- "http://screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21" > webpage <- getURL(theurl) > > Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : > Failed writing body (1371 != 1461) > In addition: Warning messages: > 1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : > truncating string with embedded nul: 'ttp://finance. > ## I DELETED SOME HERE FOR BREVITY## al>\nData and [... truncated] > 2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : > only read 1371 of the 1461 input bytes/characters > > # THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER > destfile<-"file:///C:/projects/stock data/data/test.htm" > download.file ( theurl , destfile , quiet = TRUE ) > > # WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE > # PROBLEM AND THEN GETTING RID OF IT. > > I'd appreciate any advice. > > >