thr3ads.net - R help - [R] parsing pdf files [Jan 2010]

If this information is useful, please help other people find it:
Share via:

David Kane

2010-Jan-09 13:11 UTC

[R] parsing pdf files

I have a pdf file that I would like to parse into R:

http://www.williams.edu/Registrar/geninfo/faculty.pdf

For now, I open the file in Acrobat by hand, then save it "as text"
and then use readLines(). That works fine but a) I am concerned that
some information may be lost and b) I may be doing this a lot, so I
would rather have R grab the information from the pdf file directly.

So: is there something like readPDF() for R?

Thanks,

Dave Kane

PS. If you're curious, here is the sort of work that I want to do with
this data:
http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/

Barry Rowlingson

2010-Jan-09 13:47 UTC

head link

[R] parsing pdf files

On Sat, Jan 9, 2010 at 1:11 PM, David Kane <dave at kanecap.com>
wrote:> I have a pdf file that I would like to parse into R:
>
> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>
> For now, I open the file in Acrobat by hand, then save it "as
text"
> and then use readLines(). That works fine but a) I am concerned that
> some information may be lost and b) I may be doing this a lot, so I
> would rather have R grab the information from the pdf file directly.
>
> So: is there something like readPDF() for R?
 What could it do that saving as text from Acrobat couldn't do? Here's
the problem - PDF is a page description format, it's not designed to
be read back. There's no guarantee that the letters on the page appear
in the PDF in the same order as they seem on the page. The page could
have all the letter 'a's, then the 'b's and so on, positioned in
their
right places to make up words. To reconstruct the words you'd have to
spot where the letters were being placed, and then figure out the
breaks and make up the words. Good luck making the sentences.

 Most PDFs aren't that perverse, and you can often get sensible text
out of them. But then you run into font encodings and graphics and
column layouts and stuff. Any effort put into writing a readPDF()
would have to be redone every time someone tried to read a PDF :)

 On Linux/Unix there's a bunch of command line tools for trying to do
this kind of thing with PDF files - see pdftotext for example. You
could run that from R with system() and then read the text with
readLines. But there's absolutely no guarantees this will work.
Windows/Mac versions (did you say what your platform was?) of the
command line tools may be available.

 The real answer is to get the original data in a format with some
kind of semantics that R could read, for example a CSV or some nice
XML format.

Barry

-- 
blog: http://geospaced.blogspot.com/
web: http://www.maths.lancs.ac.uk/~rowlings
web: http://www.rowlingson.com/
twitter: http://twitter.com/geospacedman
pics: http://www.flickr.com/photos/spacedman

Laurent Rhelp

2010-Jan-09 14:42 UTC

head link

[R] parsing pdf files

David Kane a ?crit :
>I have a pdf file that I would like to parse into R:
>
>http://www.williams.edu/Registrar/geninfo/faculty.pdf
>
>For now, I open the file in Acrobat by hand, then save it "as
text"
>and then use readLines(). That works fine but a) I am concerned that
>some information may be lost and b) I may be doing this a lot, so I
>would rather have R grab the information from the pdf file directly.
>
>So: is there something like readPDF() for R?
>
>Thanks,
>
>Dave Kane
>
>PS. If you're curious, here is the sort of work that I want to do with
>this data:
>http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>  
>Did you know this site ?

http://www.accesspdf.com/pdftk/

There could be a command line to transform the pdf file in XML format 
and then read the XML file with R.

Mark Wardle

2010-Jan-10 11:11 UTC

head link

[R] parsing pdf files

If you can use a R <-> java interface, you could use itext to do this
as long as the PDF is fairly sane.

see http://itextpdf.com/

It is what pdftk uses.

b/w

Mark

2010/1/9 David Kane <dave at kanecap.com>:> I have a pdf file that I would like to parse into R:
>
> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>
> For now, I open the file in Acrobat by hand, then save it "as
text"
> and then use readLines(). That works fine but a) I am concerned that
> some information may be lost and b) I may be doing this a lot, so I
> would rather have R grab the information from the pdf file directly.
>
> So: is there something like readPDF() for R?
>
> Thanks,
>
> Dave Kane
>
> PS. If you're curious, here is the sort of work that I want to do with
> this data:
> http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Dr. Mark Wardle
Specialist registrar, Neurology
Cardiff, UK

John Maindonald

2010-Jan-10 14:51 UTC

head link

[R] parsing pdf files

Oblivious to the problems that Barry notes, I have used pdftotext,
from Xpdf at http://www.foolabs.com/xpdf/download.html
without apparent problem; this under MacOS X.  For my purposes,
I need to retain the CTRL/Fs that indicate page throws.  Other
converters that I have investigate seem not to retain such
information.  I use it with command line options thus:

pdftotext -layout -eol unix rnotes.pdf rnotes.txt

Given a pdf for a manuscript,  I have an R function that can then
create an index of functions that appear with their opening and 
closing parentheses.  (It may pick up a few strays, which can be
weeded out.)

The only issue I've found has been for use with the listings package.
In the LaTeX source, I set the option 'columns' to
'fullflexible', in order
to generate a pdf file that does not have the unwanted hidden spaces,
which are then carried across to the text file.  

{\lstset{language=R, xleftmargin=2pt,
         basicstyle=\ttfamily,
         columns=fullflexible,           % Omit for final manuscript
         showstringspaces=false}}
{}

The setting 'columns=fullflexible'  messes up the formatting
in places where I use tabbing, so that I need to change
'columns' back to the default for the final pdf.

Here is the function that does most of the work. Like the function
that follows, it could no doubt be greatly tidied up:

locatefun <-
function(txlines){
        idtxt <- "\\.?[a-zA-Z][a-zA-Z0-9]*(\\.[a-zA-Z]+)*\\("
        z <- regexpr("\014", txlines)
        z[z>0] <- 1
        z[z<=0] <- 0
        page <- cumsum(z)+1
        k <- 0
        findfun <- function(tx){
            mn <- t(sapply(tx, function(x)
                       {m <- regexpr(idtxt, x); 
                        c(m, attr(m, "match.length"))}))
            mn[,2] <- mn[,1]+mn[,2]
            rownames(mn) <- paste(1:dim(mn)[1])
            mn
        }
        for(i in 1:100){
            mn <- findfun(txlines)
            if(all(mn[,1]==-1))break
            here <- mn[,1]>0
            page <- page[here]
            txlines <- txlines[here]
            mn <- mn[here, , drop=FALSE]
            m1 <- regexpr("\\(", txlines)-1
            tx1 <- substring(txlines,mn[,1],m1)
            if(i==1)xy <- data.frame(nam=I(tx1), page=page) else
            xy <- rbind(xy, data.frame(nam=I(tx1), page=page))
            txlines <- substring(txlines,mn[,2])
            here2 <- nchar(txlines)>0
            txlines <- txlines[here2]
            page <- page[here2]
            if(length(txlines)==0)break
        }
        zz <- !xy[,1]%in%
c("","al","Pr","T", "F",
"n","P", "y", "A",
                            "transformation",
"left","f","site.","a","b",
"II",
                            "ARCH", "ARMA", "MA")
        xy <- xy[zz,]
        nam <- xy$nam
        ch <- substring(nam,1,1)
        nam[ch%in%c("="," ",",")] <-
substring(nam[ch%in%c("="," ",",")],2)
        xy$nam <- nam
        ord <- order(xy[,2])
        xy[ord,]
    }

Here is the function that calls findfun:

 makeFunIndex <-
function(sourceFile="rnotes.txt",
             frompath="~/_notes/rnotes/", fileout=NULL,
             availfun=funpack,
             offset=0){
        ## pdftotext -layout -eol unix rnotes.pdf rnotes.txt
        len <- nchar(sourceFile)
        lfrom <- nchar(frompath)
        if(substring(frompath, lfrom, lfrom)=="/")frompath <-
            substring(frompath, 1, lfrom-1)
        if(is.null(fileout)){
            if (substring(sourceFile,len - 3, len-3) == ".") 
                fnam <- substring(sourceFile, 1, len - 4) else fnam <-
sourceFile
            fileout <- paste(fnam, ".fdx", sep = "")
            fdxfile <- paste(fileout, sep="/")
            fndfile <- paste(fnam, ".fnd", sep = "")
        }
        sourceFile <- paste(frompath, sourceFile, sep="/")
        print(paste("Send output to", fndfile))
        tx <- readLines(sourceFile, warn=FALSE)
        entrymat <- locatefun(tx)
        backn <- regexpr("\\n",entrymat[,1],fixed=TRUE)
        entrymat <- entrymat[backn < 0,]
        entrymat[,2] <- entrymat[,2] - offset
        entrymat[,1] <- gsub("_","\\_",entrymat[,1],
fixed=TRUE)
        nmatch <- match(entrymat[,1], availfun[,2], nomatch=0)
        use <- nmatch > 0
        print("Unmatched functions:")
        print(unique(entrymat[!use,1]))
        entrymat[use,1] <- paste(entrymat[use,1], " ({\\em ",
                                 availfun[nmatch,1], "})",
sep="")
        funentries <- paste("\\indexentry  ", "{",
entrymat[,1],"}{",
                            entrymat[,2], "}",sep="")
        write(funentries, fdxfile)
        system(paste("makeindex -o", fndfile, fdxfile))
    }

John Maindonald             email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 10/01/2010, at 10:00 PM, r-help-request@r-project.org wrote:
> From: Barry Rowlingson <b.rowlingson@lancaster.ac.uk>
> Date: 10 January 2010 12:47:01 AM AEDT
> To: David Kane <dave@kanecap.com>
> Cc: r-help@r-project.org
> Subject: Re: [R] parsing pdf files
> 
> 
> On Sat, Jan 9, 2010 at 1:11 PM, David Kane <dave@kanecap.com> wrote:
>> I have a pdf file that I would like to parse into R:
>> 
>> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>> 
>> For now, I open the file in Acrobat by hand, then save it "as
text"
>> and then use readLines(). That works fine but a) I am concerned that
>> some information may be lost and b) I may be doing this a lot, so I
>> would rather have R grab the information from the pdf file directly.
>> 
>> So: is there something like readPDF() for R?
> 
> What could it do that saving as text from Acrobat couldn't do?
Here's
> the problem - PDF is a page description format, it's not designed to
> be read back. There's no guarantee that the letters on the page appear
> in the PDF in the same order as they seem on the page. The page could
> have all the letter 'a's, then the 'b's and so on,
positioned in their
> right places to make up words. To reconstruct the words you'd have to
> spot where the letters were being placed, and then figure out the
> breaks and make up the words. Good luck making the sentences.
> 
> Most PDFs aren't that perverse, and you can often get sensible text
> out of them. But then you run into font encodings and graphics and
> column layouts and stuff. Any effort put into writing a readPDF()
> would have to be redone every time someone tried to read a PDF :)
> 
> On Linux/Unix there's a bunch of command line tools for trying to do
> this kind of thing with PDF files - see pdftotext for example. You
> could run that from R with system() and then read the text with
> readLines. But there's absolutely no guarantees this will work.
> Windows/Mac versions (did you say what your platform was?) of the
> command line tools may be available.
> 
> The real answer is to get the original data in a format with some
> kind of semantics that R could read, for example a CSV or some nice
> XML format.
> 
> Barry
> 
> -- 
> blog: http://geospaced.blogspot.com/
> web: http://www.maths.lancs.ac.uk/~rowlings
> web: http://www.rowlingson.com/
> twitter: http://twitter.com/geospacedman
> pics: http://www.flickr.com/photos/spacedman
> 

	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more apparently analagous threads

R help - Jan 2010 - parsing pdf files

[R] parsing pdf files

[R] parsing pdf files

[R] parsing pdf files

[R] parsing pdf files

[R] parsing pdf files

Reasonably Related Threads