thr3ads.net - R help - [R] Figuring out encodings of PDFs in R [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Jonas Michaelis

2012-Jun-26 19:28 UTC

[R] Figuring out encodings of PDFs in R

Dear list,

I am currently scraping some text data from several PDFs using the
readPDF() function in the tm package. This all works very well and in most
cases the encoding seems to be "latin1" - in some, however, it is not.
Is
there a good way in R to check character encodings? I found the functions
is.utf8() and is.local() in the tau package but that obviously only gets me
so far.

Thanks.

	[[alternative HTML version deleted]]

Duncan Murdoch

2012-Jun-27 00:07 UTC

head link

[R] Figuring out encodings of PDFs in R

On 12-06-26 3:28 PM, Jonas Michaelis wrote:> Dear list,
>
> I am currently scraping some text data from several PDFs using the
> readPDF() function in the tm package. This all works very well and in most
> cases the encoding seems to be "latin1" - in some, however, it is
not. Is
> there a good way in R to check character encodings? I found the functions
> is.utf8() and is.local() in the tau package but that obviously only gets me
> so far.
>
There are heuristics for guessing encodings, but I don't think they are 
built into R.  I think the way to do what you want is to read the PDF 
spec to find out how the strings are encoded in the source file, and 
believe that.

Duncan Murdoch

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Jun 2012 - Figuring out encodings of PDFs in R

[R] Figuring out encodings of PDFs in R

[R] Figuring out encodings of PDFs in R

Apparently Analagous Threads