thr3ads.net - R help - [R] Extracting the first currency value from PDF files [May 2020]

If this information is useful, please help other people find it:
Share via:

Manish Mukherjee

2020-May-13 13:33 UTC

[R] Extracting the first currency value from PDF files

Hi All,

Need some help with the following code , i have a number of pdf files , and the
first page of those files gives a currency value $xxx,xxx,xxx . How to extract
this value from a number of PDF files and put it in a data frame . I am able to
do it for a single file
with the code where opinions is the text data and 1 is the first currency value
```
d=str_nth_currency(opinions, 1)
df = subset(d, select = c(amount) )
df

I want this to loop over multiple pdf files

I have tried somesthing like this but not working
for (i in 1:length(files)){
  print(i)
  pdf_text(paste("filepath ", files[i],sep = ""))
  str_nth_currency(files[i], 1)
}


Please help.

	[[alternative HTML version deleted]]

Jeff Newmiller

2020-May-13 13:44 UTC

head link

[R] Extracting the first currency value from PDF files

PDF files are actually "programs" that place graphic symbols on pages,
and the order in which those symbols are placed (the order in which most
pdf-to-text conversions return characters) may have nothing to do with how they
appear visually. There is not even a guarantee that those symbols are
represented as characters in the file... they could be part of embedded bitmaps.

In summary, you need to review what your "pdf_text" function is able
to extract from your files without filtering... it may or may not be consistent
enough to allow you to do what you want... and we certainly have no idea what it
is able to extract from your files.

On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee <manishmukherjee at
hotmail.com> wrote:>Hi All,
>
>Need some help with the following code , i have a number of pdf files ,
>and the first page of those files gives a currency value $xxx,xxx,xxx .
>How to extract this value from a number of PDF files and put it in a
>data frame . I am able to do it for a single file
>with the code where opinions is the text data and 1 is the first
>currency value
>```
>d=str_nth_currency(opinions, 1)
>df = subset(d, select = c(amount) )
>df
>
>I want this to loop over multiple pdf files
>
>I have tried somesthing like this but not working
>for (i in 1:length(files)){
>  print(i)
>  pdf_text(paste("filepath ", files[i],sep = ""))
>  str_nth_currency(files[i], 1)
>}
>
>
>Please help.
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

John Kane

2020-May-13 14:04 UTC

head link

[R] Extracting the first currency value from PDF files

It looks like you are using the str_nth_currency() function from the strex
package but we have no idea of what the pdf files are or how you are
importing them is to R.
We need a lot more information on what you are doing "before" you use
the
function.

Have a look at
http://adv-r.had.co.nz/Reproducibility.html
or
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example



On Wed, 13 May 2020 at 09:33, Manish Mukherjee <manishmukherjee at
hotmail.com>
wrote:
> Hi All,
>
> Need some help with the following code , i have a number of pdf files ,
> and the first page of those files gives a currency value $xxx,xxx,xxx . How
> to extract this value from a number of PDF files and put it in a data frame
> . I am able to do it for a single file
> with the code where opinions is the text data and 1 is the first currency
> value
> ```
> d=str_nth_currency(opinions, 1)
> df = subset(d, select = c(amount) )
> df
>
> I want this to loop over multiple pdf files
>
> I have tried somesthing like this but not working
> for (i in 1:length(files)){
>   print(i)
>   pdf_text(paste("filepath ", files[i],sep = ""))
>   str_nth_currency(files[i], 1)
> }
>
>
> Please help.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
John Kane
Kingston ON Canada

	[[alternative HTML version deleted]]

Rasmus Liland

2020-May-13 14:17 UTC

head link

[R] Extracting the first currency value from PDF files

On 2020-05-13 06:44 -0700, Jeff Newmiller wrote:> On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee wrote:
> > 
> > How to extract this value from a number 
> > of PDF files and put it in a data frame. 
> 
> they could be part of embedded bitmaps.
Dear Manish and Jeff,

I recently found the programs pdftoppm [1] 
and Google tesseract [2] to be really useful 
when reading text from pdfs formatted as "a 
single column of text of variable sizes", 
e.g. a receipt from a grocery store :)

folder <- "path/to/pdfs"
pdfs <- list.files(folder, ".pdf$")
pdf <- pdfs[1]
cmd <-
  paste0("pdftoppm -png -r 500 ",
         folder, pdf, " /tmp/out && ",
         "tesseract /tmp/out-1.png - ",
         "-l nor --psm 4")
lines <- system(cmd, intern=TRUE)
# x <- lapply(x, system, intern=TRUE)
# names(x) <- pdfs
# saveRDS(x, "texts.rds")

In any other case with a sensibly formatted 
pdf, I would have used pdftotext [3] ...

Best,
Rasmus

[1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html
[2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html
[3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html

R help - May 2020 - Extracting the first currency value from PDF files

[R] Extracting the first currency value from PDF files

[R] Extracting the first currency value from PDF files

[R] Extracting the first currency value from PDF files

[R] Extracting the first currency value from PDF files