CecĂlia Carmo
2023-Jul-05 09:14 UTC
[R] textual analysis - transforming several pdf to txt - naming the files
I am taking my first steps in textual analysis with R. I have pdf files consisting of company reports for several years (1 file corresponds to 1 company and 1 year). My idea is to start by transforming all my pdf files into txt files for further treatment and analysis (this will allow me to group the files by company or by year, depending on the future analysis to be performed). I do not have in-depth knowledge of programming in R. I just adapt codes that I find, to my needs. Here goes the first doubt in a code I'm adapting: My pdf files are in one directory named "pdfs". The names of my files are, for example, SONAE2020FS.pdf, EDP2021GS.pdf I want to convert them to txt and give the same names as in the pdf files: SOANE2020FS.txt, EDP2021GS.txt I'm running the following scrip, but the names of txt files that I obtain are: pdftext1, pdftext2, pdftext3... What do I need to change? Thank you very much, Cec?lia Carmo Universidade de Aveiro - Portugal dirpath <- ("/Users/ceciliacarmo/documents/RTextualAnalysis/data/pdfs") library(pdftools) library(dplyr) convertpdf2txt <- function(dirpath){ files <- list.files(dirpath, full.names = T) x <- sapply(files, function(x){ x <- pdftools::pdf_text(x) %>% paste0(collapse = " ") %>% stringr::str_squish() return(x) }) } # apply function txts <- convertpdf2txt(here::here("data", "pdf/")) # add names to txt files names(txts) <- paste0(here::here("data","pdftext"), 1:length(txts), sep = "") [[alternative HTML version deleted]]
Rui Barradas
2023-Jul-05 09:57 UTC
[R] textual analysis - transforming several pdf to txt - naming the files
?s 10:14 de 05/07/2023, Cec?lia Carmo escreveu:> I am taking my first steps in textual analysis with R. > I have pdf files consisting of company reports for several years (1 file corresponds to 1 company and 1 year). > My idea is to start by transforming all my pdf files into txt files for further treatment and analysis (this will allow me to group the files by company or by year, depending on the future analysis to be performed). > I do not have in-depth knowledge of programming in R. I just adapt codes that I find, to my needs. Here goes the first doubt in a code I'm adapting: > > My pdf files are in one directory named "pdfs". The names of my files are, for example, SONAE2020FS.pdf, EDP2021GS.pdf > I want to convert them to txt and give the same names as in the pdf files: SOANE2020FS.txt, EDP2021GS.txt > I'm running the following scrip, but the names of txt files that I obtain are: pdftext1, pdftext2, pdftext3... > What do I need to change? > Thank you very much, > > Cec?lia Carmo > Universidade de Aveiro - Portugal > > > dirpath <- ("/Users/ceciliacarmo/documents/RTextualAnalysis/data/pdfs") > > > library(pdftools) > > library(dplyr) > > > convertpdf2txt <- function(dirpath){ > > files <- list.files(dirpath, full.names = T) > > x <- sapply(files, function(x){ > > x <- pdftools::pdf_text(x) %>% > > paste0(collapse = " ") %>% > > stringr::str_squish() > > return(x) > > }) > > } > > # apply function > > txts <- convertpdf2txt(here::here("data", "pdf/")) > > # add names to txt files > > names(txts) <- paste0(here::here("data","pdftext"), 1:length(txts), sep = "") > > > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, Try the following. The corrected function convertpdf2txt assigns names based on the files variable. It uses tools::file_path_sans_ext to keep the filename without extension and pastes the new extension to them. In the end there is no need to call here::here again, the list already is a named list. convertpdf2txt <- function(dirpath){ files <- list.files(dirpath, pattern = "Consoli.*\\.pdf$", full.names = TRUE) files <- chartr("\\", "/", files) x <- lapply(files, function(x){ pdftools::pdf_text(x) %>% paste0(collapse = " ") %>% stringr::str_squish() }) new_names <- tools::file_path_sans_ext(files) new_names <- paste(new_names, "txt", sep = ".") setNames(x, new_names) } # apply function # note that my test files are in "~/Temp" txts <- convertpdf2txt(here::here("~", "Temp")) names(txts) Hope this helps, Rui Barradas