Wolfgang Grond
2022-Sep-06 09:39 UTC
[R] Reading PDF files with German umlauts using tabulizer
Dear all,
I have some trouble with reading PDF files in German language.
I want to extract text and tables with the tabulizer package, and every
things goes well as long as I read English texts.
When I try the same code
text <- extract_text(file = "Pub_001.pdf")
with documents in German language
German umlauts are not recognized.
They are either replaced by a combination of characters.
Instead of
"Entmischung und Kristallisation in Gl?sern des Systems"
--
I get
"Entmischung und Kristallisation in GHisern des Systems"
--
or replaced by ascii like this
instead of
"In Gl?sern des Systems"
-
I get
"In Glasern des Systems"
-
Opening the file with Adobe Reader tells me that encoding is "Ansi"
Is there a way to read this file correctly?
Thanks in advance for any idea.
Regards
Hi! The package "tabulizer" seems to be removed from package repositories, so it is a bit hard to test. I found the documentation and the syntax of "extract_text" is: extract_text(file, pages = NULL, area = NULL, password = NULL, encoding = NULL, copy = FALSE) So have you tried to set the "encoding" parameter? HTH, Kimmo ti, 2022-09-06 kello 11:39 +0200, Wolfgang Grond kirjoitti:> Dear all, > > I have some trouble with reading PDF files in German language. > > I want to extract text and tables with the tabulizer package, and > every > things goes well as long as I read English texts. > > When I try the same code > > text <- extract_text(file = "Pub_001.pdf") > > with documents in German language > > German umlauts are not recognized. > > They are either replaced by a combination of characters. > > Instead of > > "Entmischung und Kristallisation in Gl?sern des Systems" > ????????????????????????????????????? -- > I get > > "Entmischung und Kristallisation in GHisern des Systems" > ????????????????????????????????????? -- > > or replaced by ascii like this > > instead of > > "In Gl?sern des Systems" > ?????? - > I get > > "In Glasern des Systems" > ?????? - > > Opening the file with Adobe Reader tells me that encoding is "Ansi" > > Is there a way to read this file correctly? > > Thanks in advance for any idea. > > Regards > > ______________________________________________ > R-help at r-project.org?mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.