Tony Breyal
2008-Nov-13 15:10 UTC
[R] readPDF() -- unsure how to install xpdf to make this work?
Dear R-Help, I need to convert a set of '.pdf' files into an equivalent set of '.txt' files. This is so that i can do some text mining on the content. In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In that lovely package, there is a function called 'readPDF()'. In order to use this, ?readPDF says "Note that this PDF reader needs both the tools pdftotext and pdfinfo installed and accessable on your system." These tools are available from http://www.foolabs.com/xpdf/download.html I am able to download this and use it easily from a dos window to convert a pdf file into a txt file. Question: how do i make these tools available to R, so that i can use the readPDF() function? Thank you in advance for any help, and I hope the above made sense. Tony Breyal ###OS = Windows Vista Ultimate>> sessionInfo()R version 2.8.0 (2008-10-20) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3 RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16 lattice_0.17-15 filehash_2.0 loaded via a namespace (and not attached): [1] proxy_0.4-1
clair.crossupton at googlemail.com
2008-Nov-15 18:14 UTC
[R] readPDF() -- unsure how to install xpdf to make this work?
Hello, I was just wondering if you had found a solution? I am having the same difficulty of converting pdf's into plain text documents in R. I originally thought I could use the readLines() function, but as you can see below that did not work. R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r- intro.pdf" R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf" R> download.file(url = my.url, destfile=my.destfile, mode='wb') R> txt <- readLines(my.destfile) R> txt [1] "%PDF-1.4" [2] "%????" [3] "1 0 obj <<" [4] "/Length 587 " [5] "/Filter / FlateDecode" [6] ">>" [7] "stream" [8] "x?mTM??@\020??+z\017&????\024tBL\020$???d4??*?.?\002\001<???_???f \017?W?_w???r??c;???`G?U?O?V?&??????\006????\027[v???6?W?7??T??vb \030??uYt/N?.??5??????=\025?S?<b???G??" Warm Regards, Clair On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:> Dear R-Help, > > I need to convert a set of '.pdf' files into an equivalent set of > '.txt' files. This is so that i can do some text mining on the > content. > > In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ > Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In > that lovely package, there is a function called 'readPDF()'. In order > to use this, ?readPDF says > > ? ? "Note that this PDF reader needs both the tools pdftotext and > pdfinfo installed and accessable on your system." > > These tools are available fromhttp://www.foolabs.com/xpdf/download.html > > I am able to download this and use it easily from a dos window to > convert a pdf file into a txt file. > > Question: how do i make these tools available to R, so that i can use > the readPDF() function? > > Thank you in advance for any help, and I hope the above made sense. > Tony Breyal > > ###OS = Windows Vista Ultimate>> sessionInfo() > > R version 2.8.0 (2008-10-20) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] grid ? ? ?stats ? ? graphics ?grDevices utils ? ? datasets > methods ? base > > other attached packages: > [1] tm_0.3-1 ? ? ? ? ? XML_1.98-1 ? ? ? ? Snowball_0.0-3 > RWeka_0.3-14 ? ? ? rJava_0.6-0 ? ? ? ?Matrix_0.999375-16 > lattice_0.17-15 ? ?filehash_2.0 > > loaded via a namespace (and not attached): > [1] proxy_0.4-1 > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.