I need to do text mining on PDF files. I understand there is a readPDF command in tm that can be used. Have read the 2008 posts on converting PDF files to text by Tony Breyal and others. Wondering if the procedure has been standardized in any tutorial or otherwise? Being new to R, I was able to follow only part of the discussion. Any way to get a set of step by step instructions appropriate for my level? I am an ageing academic who has worked mostly with SAS and MATLAB. ----- TO GET MORE DETAILS CLICK HERE -- View this message in context: http://r.789695.n4.nabble.com/Reading-PDF-files-tp4651657.html Sent from the R help mailing list archive at Nabble.com.
Hello: Apart from readPDF in the tm package, you can use the pdf to text converter command in linux, which is "pdftotext". Say "file.pdf" is your file, from R you'd use: system("pdftotext file.pdf -layout") This invokes the pdftotext command from within R and creates a file called "file.txt" with the converted pdf, which you'd have to read into R. The -layout option is so the conversion to text is as similar as possible to the original layout of the pdf file. Regards, jose loreto romero palma [[alternative HTML version deleted]]