thr3ads.net - R help - [R] puzzle using gsub (and encodings maybe) [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Adrian Dragulescu

2009-Oct-14 17:30 UTC

[R] puzzle using gsub (and encodings maybe)

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)
> x
[1] "NEW YORK NEW ENGLAND"> gsub(" -", "-", x)            # this does not work!
[1] "NEW YORK NEW ENGLAND"> Encoding(x)                   # is x in a special encoding? no
[1] "unknown"> y = "NEW YORK -NEW ENGLAND"   # I type in variable y
> gsub(" -", "-", y)            # and gsub works as
expected
[1] "NEW YORK-NEW ENGLAND">
I'm sure the problem has to do with the way I read the variable x.  But even
if
I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.

Thanks for any suggestions,
Adrian

Duncan Murdoch

2009-Oct-14 17:38 UTC

head link

[R] puzzle using gsub (and encodings maybe)

On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:> Hello,
> 
> Below is some output that shows my issue.
> 
> I have a variable x that I read from a file (more on this below)
> 
>> x
> [1] "NEW YORK NEW ENGLAND"
>> gsub(" -", "-", x)            # this does not work!
> [1] "NEW YORK NEW ENGLAND"
It looks as though it worked, presumably because something got lost in 
your email.

Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch
>> Encoding(x)                   # is x in a special encoding? no
> [1] "unknown"
>> y = "NEW YORK -NEW ENGLAND"   # I type in variable y
>> gsub(" -", "-", y)            # and gsub works as
expected
> [1] "NEW YORK-NEW ENGLAND"
>>
> 
> I'm sure the problem has to do with the way I read the variable x.  But
even if
> I change the encoding for x to ASCII, I still cannot do the sub.
> I get x by reading a pdf file with pdftotext so you will not be able to 
> replicate my issue.
> 
> Thanks for any suggestions,
> Adrian
> 
> ______________________________________________
> R-help at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Young

2009-Oct-14 17:49 UTC

head link

[R] Removing Embedded Null characters from text/html

Hi,

I'm trying to download some data from the web and am running into
problems with 'embedded null' characters.  These seem to indicate to R
that it should stop processing the page so I'd like to remove them.
I've been looking around and can't seem to identify exactly what the
character is and consequently how to remove it.

# THE CODE WORKS ON THIS PAGE
library(RCurl)
library(XML)
theurl <-
"en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)

# BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS
theurl <-
"screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21"
webpage <- getURL(theurl)

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  Failed writing body (1371 != 1461)
In addition: Warning messages:
1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  truncating string with embedded nul: 'ttp://finance.  
  ## I DELETED SOME HERE FOR BREVITY##  al>\nData and  [... truncated]
2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  only read 1371 of the 1461 input bytes/characters

# THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER
destfile<-"file:///C:/projects/stock data/data/test.htm"
download.file ( theurl , destfile , quiet = TRUE )

# WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE
# PROBLEM AND THEN GETTING RID OF IT.

I'd appreciate any advice.


-- 
Best regards,

David Young
Marketing and Statistical Consultant
Madrid, Spain
+34 913 540 381
linkedin.com/in/europedavidyoung

Maybe Matching Threads

Search for more maybe matching threads

R help - Oct 2009 - puzzle using gsub (and encodings maybe)

[R] puzzle using gsub (and encodings maybe)

[R] puzzle using gsub (and encodings maybe)

[R] Removing Embedded Null characters from text/html

Maybe Matching Threads