thr3ads.net - R help - [R] grep triggering error on unicode character [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Dennis Fisher

2010-Oct-11 19:36 UTC

[R] grep triggering error on unicode character

Colleagues,

[R 2.11; OS X]

I am processing a file on the fly that contains the following text:
	XXX?? 
[email clients may display this differently -- the string is three X's
followed by two instances of the letter a with an acute accent]
I read the file with:
	X	<- readLines(FILENAME)
In this instance, the text of interest is on line 213.  When I examine line 213,
it reads:
	XXX\xe1\xe1
This makes sense because the unicode mapping for ? [a-acute] is U+00E1.

The problem arises when I attempt to manipulate the text in the file.  For
example:
	> grep("XXX", X[213])
	integer(0)
	Warning message:
	In grep("XXX", X[213]) : input string 1 is invalid in this locale
Worse, yet:
	> tolower(X[213]) 
	Error in tolower(X[213]) : invalid multibyte string 1 

I am focussing on resolving the first problem, i.e., identifying a line
containing XXX.  If I can do so, I can remove the offending lines before I
execute the tolower command.
However, I am stumped as to how to resolve either problem.

Any help would be appreciated.

Thanks.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

Duncan Murdoch

2010-Oct-11 20:06 UTC

head link

[R] grep triggering error on unicode character

On 11/10/2010 3:36 PM, Dennis Fisher wrote:> Colleagues,
>
> [R 2.11; OS X]
>
> I am processing a file on the fly that contains the following text:
> 	XXX??
> [email clients may display this differently -- the string is three X's
followed by two instances of the letter a with an acute accent]
> I read the file with:
> 	X	<- readLines(FILENAME)
> In this instance, the text of interest is on line 213.  When I examine line
213, it reads:
> 	XXX\xe1\xe1
> This makes sense because the unicode mapping for ? [a-acute] is U+00E1.
That's not what it's saying:  it's saying you have three X's
followed by
two unrecognized characters with hex codes E1.  I imagine the original 
file is encoded using Latin1, because that's how ? is encoded
there.>
> The problem arises when I attempt to manipulate the text in the file.  For
example:
> 	>  grep("XXX", X[213])
> 	integer(0)
> 	Warning message:
> 	In grep("XXX", X[213]) : input string 1 is invalid in this
locale
> Worse, yet:
> 	>  tolower(X[213])
> 	Error in tolower(X[213]) : invalid multibyte string 1
>
> I am focussing on resolving the first problem, i.e., identifying a line
containing XXX.  If I can do so, I can remove the offending lines before I
execute the tolower command.
> However, I am stumped as to how to resolve either problem.
>
> Any help would be appreciated.
You need to declare the encoding of the file when you read it if it's 
not in the default encoding for your locale, or re-encode it.  See 
?readLines.

Duncan Murdoch
>
> Thanks.
>
> Dennis
>
> Dennis Fisher MD
> P<  (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Oct 2010 - grep triggering error on unicode character

[R] grep triggering error on unicode character

[R] grep triggering error on unicode character

Possibly Parallel Threads