Hi All,
I have been trying to do OCR within R (reading PDF data which data as scanned
image). Have been reading about this @
http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1
-l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ",
i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i,
".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time,
for 4 pages of a pdf, but will look into the scalability part later, first
trying if this works or not)
While running this, the first two steps work fine.
While runinng the 3rd step, i.e
**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i,
".tif ", i, " -l eng")))**
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or
Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.
Regards,
Anshuk Pal Chaudhuri
[[alternative HTML version deleted]]
This code is using R like a command shell... there really is not much chance
that R is the problem, and this is not a "tesseract" support forum, so
this seems quite off-topic.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On August 12, 2015 10:05:19 PM PDT, Anshuk Pal Chaudhuri <anshuk.p at
motivitylabs.com> wrote:>Hi All,
>
>I have been trying to do OCR within R (reading PDF data which data as
>scanned image). Have been reading about this @
>http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
>
>This a very good post.
>
>Effectively 3 steps:
>
>convert pdf to ppm (an image format)
>convert ppm to tif ready for tesseract (using ImageMagick for convert)
>convert tif to text file
>The effective code for the above 3 steps as per the link post:
>
>lapply(myfiles, function(i){
> # convert pdf to ppm (an image format), just pages 1-10 of the PDF
> # but you can change that easily, just remove or edit the
> # -f 1 -l 10 bit in the line below
>shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1
-l 10 -r
>600 ocrbook")))
> # convert ppm to tif ready for tesseract
>shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm
", i,
>".tif")))
> # convert tif to text file
>shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i,
".tif ", i,
>" -l eng")))
> # delete tif file
> file.remove(paste0(i, ".tif" ))
> })
>The first two steps are happening fine. (although taking good amount of
>time, for 4 pages of a pdf, but will look into the scalability part
>later, first trying if this works or not)
>
>While running this, the first two steps work fine.
>
>While runinng the 3rd step, i.e
>
>**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i,
".tif ",
>i, " -l eng")))**
>I having this error:
>
>Error: evaluation nested too deeply: infinite recursion /
>options(expressions=)?
>
>Or
>
>Tesseract is crashing.
>
>Any workaround or root cause analysis would be appreciated.
>
>Regards,
>Anshuk Pal Chaudhuri
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
On 13/08/2015 1:29 AM, Jeff Newmiller wrote:> This code is using R like a command shell... there really is not much chance that R is the problem, and this is not a "tesseract" support forum, so this seems quite off-topic.I would have guessed the same, but the error message looks like an R message. But I can see anything very different in the 3rd step compared to the first, so I don't know what would be going on. The use of shQuote looks wrong: Anshuk probably doesn't want to quote the whole command expression, just parts of it that may cause problems. And the docs do recommend using system2() rather than shell(). But I don't think either of those things should have caused that error. Duncan Murdoch> > On August 12, 2015 10:05:19 PM PDT, Anshuk Pal Chaudhuri <anshuk.p at motivitylabs.com> wrote: >> Hi All, >> >> I have been trying to do OCR within R (reading PDF data which data as >> scanned image). Have been reading about this @ >> http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/ >> >> This a very good post. >> >> Effectively 3 steps: >> >> convert pdf to ppm (an image format) >> convert ppm to tif ready for tesseract (using ImageMagick for convert) >> convert tif to text file >> The effective code for the above 3 steps as per the link post: >> >> lapply(myfiles, function(i){ >> # convert pdf to ppm (an image format), just pages 1-10 of the PDF >> # but you can change that easily, just remove or edit the >> # -f 1 -l 10 bit in the line below >> shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r >> 600 ocrbook"))) >> # convert ppm to tif ready for tesseract >> shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, >> ".tif"))) >> # convert tif to text file >> shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, >> " -l eng"))) >> # delete tif file >> file.remove(paste0(i, ".tif" )) >> }) >> The first two steps are happening fine. (although taking good amount of >> time, for 4 pages of a pdf, but will look into the scalability part >> later, first trying if this works or not) >> >> While running this, the first two steps work fine. >> >> While runinng the 3rd step, i.e >> >> **shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", >> i, " -l eng")))** >> I having this error: >> >> Error: evaluation nested too deeply: infinite recursion / >> options(expressions=)? >> >> Or >> >> Tesseract is crashing. >> >> Any workaround or root cause analysis would be appreciated. >> >> Regards, >> Anshuk Pal Chaudhuri >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >