Hello, Below is some output that shows my issue. I have a variable x that I read from a file (more on this below)> x[1] "NEW YORK NEW ENGLAND"> gsub(" -", "-", x) # this does not work![1] "NEW YORK NEW ENGLAND"> Encoding(x) # is x in a special encoding? no[1] "unknown"> y = "NEW YORK -NEW ENGLAND" # I type in variable y > gsub(" -", "-", y) # and gsub works as expected[1] "NEW YORK-NEW ENGLAND">I'm sure the problem has to do with the way I read the variable x. But even if I change the encoding for x to ASCII, I still cannot do the sub. I get x by reading a pdf file with pdftotext so you will not be able to replicate my issue. Thanks for any suggestions, Adrian
On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:> Hello, > > Below is some output that shows my issue. > > I have a variable x that I read from a file (more on this below) > >> x > [1] "NEW YORK NEW ENGLAND" >> gsub(" -", "-", x) # this does not work! > [1] "NEW YORK NEW ENGLAND"It looks as though it worked, presumably because something got lost in your email. Could you post charToRaw(x) so we can see what's in x? Duncan Murdoch>> Encoding(x) # is x in a special encoding? no > [1] "unknown" >> y = "NEW YORK -NEW ENGLAND" # I type in variable y >> gsub(" -", "-", y) # and gsub works as expected > [1] "NEW YORK-NEW ENGLAND" >> > > I'm sure the problem has to do with the way I read the variable x. But even if > I change the encoding for x to ASCII, I still cannot do the sub. > I get x by reading a pdf file with pdftotext so you will not be able to > replicate my issue. > > Thanks for any suggestions, > Adrian > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi, I'm trying to download some data from the web and am running into problems with 'embedded null' characters. These seem to indicate to R that it should stop processing the page so I'd like to remove them. I've been looking around and can't seem to identify exactly what the character is and consequently how to remove it. # THE CODE WORKS ON THIS PAGE library(RCurl) library(XML) theurl <- "en.wikipedia.org/wiki/Brazil_national_football_team" webpage <- getURL(theurl) # BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS theurl <- "screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21" webpage <- getURL(theurl) Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : Failed writing body (1371 != 1461) In addition: Warning messages: 1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : truncating string with embedded nul: 'ttp://finance. ## I DELETED SOME HERE FOR BREVITY## al>\nData and [... truncated] 2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : only read 1371 of the 1461 input bytes/characters # THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER destfile<-"file:///C:/projects/stock data/data/test.htm" download.file ( theurl , destfile , quiet = TRUE ) # WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE # PROBLEM AND THEN GETTING RID OF IT. I'd appreciate any advice. -- Best regards, David Young Marketing and Statistical Consultant Madrid, Spain +34 913 540 381 linkedin.com/in/europedavidyoung