similar to: Figuring out encodings of PDFs in R

Displaying 20 results from an estimated 8000 matches similar to: "Figuring out encodings of PDFs in R"

2010 Jan 09
4
parsing pdf files
I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost and b) I may be doing this a lot, so I would rather have R grab the information from the pdf file directly. So: is
2009 Dec 22
2
Reading PDF files
Hi: I need to do text mining on PDF files. I understand there is a readPDF command in tm that can be used. Have read the 2008 posts on converting PDF files to text by Tony Breyal and others. Wondering if the procedure has been standardized in any tutorial or otherwise? Being new to R, I was able to follow only part of the discussion. Any way to get a set of step by step instructions
2008 Nov 13
1
readPDF() -- unsure how to install xpdf to make this work?
Dear R-Help, I need to convert a set of '.pdf' files into an equivalent set of '.txt' files. This is so that i can do some text mining on the content. In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In that lovely package, there is a function called 'readPDF()'. In order to use
2009 Dec 22
0
Reading PDF files (using xpdf)
Greetings Zaki, You should really post this question on the R-help forum so that others might benefit from any responses. It's been a while since I've done this, but if memory serves, the basic process was to download xpdf and add it to the windows path, thus making it accessable from within R. Two methods follow: Method One (easiest) - using the awesome ?system command: (1) Download
2017 Aug 01
3
special latin1 do not print as glyphs in current devel on windows
Upon further inspection, I think these are at least two problems. First the issue with printing latin1/cp1252 characters in the "80" to "9F" code range. x <- c("?", "?", "?") Encoding(x) print(x) I assume that these are Unicode escapes!? (Given that Encoding(x) shows "latin1" I'd rather expect latin1/cp1252 escapes here, but
2017 Sep 14
2
special latin1 do not print as glyphs in current devel on windows
This is a follow-up on my initial posts regarding character encodings on Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and Patrick Perry's reply (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in particular (thank you for the links and the bug report!). My initial posts were quite chaotic (and partly wrong), so I am trying to clear things up a
2009 Oct 13
2
Sweave output encoding in R-2.10.0beta on Windows (Rgui <-> Rterm)
Dear developers, I have come across a (somewhat strange) change in the encoding of Sweave output from R-2.9.2pat to R-2.10.0beta (apparently specific to Rgui) on Windows installations. Of course, the NEWS file contains quite a few changes concerning encoding, but I was not able to locate an entry which explains the observed behaviour. I am not very familiar with encodings/locales/codepages,
2017 Aug 01
2
special latin1 do not print as glyphs in current devel on windows
Thank you!. My apologies again for not including the console output in my message before. I sent another e-mail with the output in the meantime, so it should be a bit clearer now, what I am seeing. In case I missed something, please let me know. Yes, I am using latin1 and cp1252 interchangebly here, mostly because Encoding() is reporting the encoding as "latin1". You presumed correctly
2014 Oct 19
1
Writing UTF8 on Windows
Recent functionality in jsonlite allows for streaming json to a user supplied connection object, such as a file, pipe or socket. RFC7159 prescribes json must be encoded as unicode; ISO-8859 (including latin1) is invalid. Hence I would like R to write strings as utf8, irrespective of the type of connection, platform or locale. Implementing this turns out to be unsurprisingly difficult on windows.
2009 Sep 30
2
R 2.9.2 crashes when sorting latin1-encoded strings
Hi everyone! I think I stumbled over a bug in the latest R 2.9.2 patched for OS X: > R version 2.9.2 Patched (2009-09-24 r49861) > i386-apple-darwin9.8.0 When I try to sort latin1-encoded character vectors, R sometimes crashes with a segmentation fault. I'm running OS X 10.5.8 and have observed this behaviour both with the i386 and x86_64 builds, in the R.app GUI as well as on
2006 May 22
2
How to execute time consuming code
Hello all, I have a screen scraping application (go to a lots of sites, extract 10k stuff, integrate the results, put them to DB etc). Now i want to use a Rails application as a frontend to this: The user can push a button which triggers the screen scraping app and view the results (preferably asynchronously, but that does not really matter right now). Questions: - Should the screen scraping app
2018 Jan 24
0
Newbie - Scrape Data From PDFs?
Hi Scott, I have never done this myself but I read something recently on the r-help distribution that was related. I just did a quick search and found a few hits that might work for you. 1. https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e 2. http://bxhorn.com/2016/extract-data-tables-from-pdf-files-in-r/ 3.
2007 Aug 31
1
locales and readLines
R-developers, I'm looking for some 'best practices', or perhaps an upstream solution (I have a deja vu about this, so sorry if it's already been asked). Problems occur when a file is encoded as latin1, but the user has a UTF-8 locale (or I guess more generally when the input locale does not match R's). Here are two examples from the Bioconductor help list:
2008 Sep 07
1
Request for advice on character set conversions (those damn Excel files, again ...)
Dear list, I have to read a not-so-small bunch of not-so-small Excel files, which seem to have traversed Window 3.1, Windows95 and Windows NT versions of the thing (with maybe a Mac or two thrown in for good measure...). The problem is that 1) I need to read strings, and 2) those strings may have various encodings. In the same sheet of the same file, some cells may be latin1, some
2020 Jun 27
1
Error in substring: invalid multibyte string
Thanks for the quick response Ivan. readLines with encoding='latin1' works for me (on Ubuntu). However I was more concerned with the inconsistency in results between substr and regexpr. I was expecting that if one of them errors because of an unknown encoding then the other should as well. Even better, if regexpr works, why shouldn't substr work as well? Incidentally the analogous
2018 Jan 24
2
Newbie - Scrape Data From PDFs?
Hello, I?m new to R and am using it with RStudio to learn the language. I?m doing so as I have quite a lot of traffic data I would like to explore. My problem is that all the data is located on a number of PDFs. Can someone point me to info on gathering data from other sources? I?ve been to the R FAQ and didn?t see anything and would appreciate your thoughts. I am quite sure now that often,
2009 Jul 21
0
sampling randomly from general correlated multivariate PDFs
(apologies if this looks like a re-post, I just sent a similar message to the r-help mail list. This version is via Nabble.) My intended application is error propagation using the ISO GUM Supplement 1 approach (propagation of distributions using Monte Carlo strategies). To automate uncertainty analysis I typically have the following data: (1) a measurement function y(x1,x2,...xn) (2) 'n'
2005 Oct 31
2
Sweave (R?) font encoding problems
Dear R list, I'm having some problems with font encodings when using R+Sweave+Latex in my native language: Portuguese. My environment: Kubuntu 5.10 Linux $> uname -a Linux nassa 2.6.12-9-686 #1 Mon Oct 10 13:25:32 BST 2005 i686 GNU/Linux R> R.version _ platform i486-pc-linux-gnu arch i486 os linux-gnu system i486, linux-gnu
2020 Jun 03
2
Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Wed, May 13, 2020 at 10:06 PM Richard W.M. Jones <rjones@redhat.com> wrote: > > On Sun, Apr 26, 2020 at 09:14:03PM +0300, Sam Eiderman wrote: > > The python3 bindings create PyUnicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded
2020 Jun 30
1
`basename` and `dirname` change the encoding to "UTF-8"
On 6/29/20 4:39 PM, Johannes Rauh wrote: > Dear R Developers, > > I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3): > >> p <- "F??/B?r" >> Encoding(p) > [1] "latin1" >> Encoding(dirname(p)) > [1] "UTF-8" >> Encoding(basename(p)) > [1] "UTF-8"