Wolfgang Grond
2022-Sep-06 09:39 UTC
[R] Reading PDF files with German umlauts using tabulizer
Dear all, I have some trouble with reading PDF files in German language. I want to extract text and tables with the tabulizer package, and every things goes well as long as I read English texts. When I try the same code text <- extract_text(file = "Pub_001.pdf") with documents in German language German umlauts are not recognized. They are either replaced by a combination of characters. Instead of "Entmischung und Kristallisation in Gl?sern des Systems" -- I get "Entmischung und Kristallisation in GHisern des Systems" -- or replaced by ascii like this instead of "In Gl?sern des Systems" - I get "In Glasern des Systems" - Opening the file with Adobe Reader tells me that encoding is "Ansi" Is there a way to read this file correctly? Thanks in advance for any idea. Regards
Hi! The package "tabulizer" seems to be removed from package repositories, so it is a bit hard to test. I found the documentation and the syntax of "extract_text" is: extract_text(file, pages = NULL, area = NULL, password = NULL, encoding = NULL, copy = FALSE) So have you tried to set the "encoding" parameter? HTH, Kimmo ti, 2022-09-06 kello 11:39 +0200, Wolfgang Grond kirjoitti:> Dear all, > > I have some trouble with reading PDF files in German language. > > I want to extract text and tables with the tabulizer package, and > every > things goes well as long as I read English texts. > > When I try the same code > > text <- extract_text(file = "Pub_001.pdf") > > with documents in German language > > German umlauts are not recognized. > > They are either replaced by a combination of characters. > > Instead of > > "Entmischung und Kristallisation in Gl?sern des Systems" > ????????????????????????????????????? -- > I get > > "Entmischung und Kristallisation in GHisern des Systems" > ????????????????????????????????????? -- > > or replaced by ascii like this > > instead of > > "In Gl?sern des Systems" > ?????? - > I get > > "In Glasern des Systems" > ?????? - > > Opening the file with Adobe Reader tells me that encoding is "Ansi" > > Is there a way to read this file correctly? > > Thanks in advance for any idea. > > Regards > > ______________________________________________ > R-help at r-project.org?mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.