search for: pdftotext

Displaying 20 results from an estimated 40 matches for "pdftotext".

2019 Dec 15
1
pdftotext latest version for CentOS 7
I have pdftotext 0.26.5, the current version for CentOS 7 and the Mate desktop as far as I can ascertain. The page https://www.xpdfreader.com/pdftotext-man.html seems to suggest that the latest version is 4.02 which seems a gigantic leap ahead. Since I have a Chinese text PDF which I am unable to extract any text...
2013 Feb 27
2
Reading a password-protected PDF
Hello respected developers, I was wondering if it is possible for xapian to read a password-protected PDF. Searches in the archives and google had yield 0 results. I also tried looking at the source code but I could not find the specific one related to this issue. The characteristic of the set of PDF is as: 1. a set of password protected PDF documents 2. all PDF is set with the same password. 3.
2013 Mar 04
2
Need Beginner Guide for Matcher Optimisations Project
Hi, While searching for a project which matches my interest andskill level, I found this project named Matcher Optimization. This project is really challenging and excting from my view point and I would like to be a part of this project. Optimization techniques metioned in the reference links provided will take some time for me to have a good understanding about them. But I am trying to get my
2009 Oct 15
1
"Complex?" import of pdf files (criminal records) into R table
Hi there, I'm facing the decision if it would be possible to transform several more or less complex pdf files into an R Table-Format or if it has to be done manually. I think it would be a impudent to expect a complete solution, but I would be grateful if anyone could give me an advice on how the structure of such a R-program could look like, and if it's possible in general. Here
2005 Oct 22
1
reading data from a pdf
> Hi, I'm trying to read data from a PDF file.Is it possible to do it > with R? Thanks, Marco If cut and paste to a text file fails, try this: pdftotext (from the xpdf project) or http://pdftohtml.sourceforge.net pdftohtml is a utility which converts PDF files into HTML and XML formats In addition, pdftk, the command line pdf toolkit may be useful http://www.accesspdf.com/pdftk/ -- Seek simplicity and mistrust it. Alfred Whitehead A witty sa...
2012 Dec 02
1
Reading PDF files
I need to do text mining on PDF files. I understand there is a readPDF command in tm that can be used. Have read the 2008 posts on converting PDF files to text by Tony Breyal and others. Wondering if the procedure has been standardized in any tutorial or otherwise? Being new to R, I was able to follow only part of the discussion. Any way to get a set of step by step instructions
2018 Apr 12
2
Windows PC PostScript printer driver -> CUPS data import fails
...rier >> Font >> Courier > It seems that .findfont can't find a font file that the PS file is > asking for. Is it possible that your Windows 10 is printing using some > new fonts that your CentOS doesn't have? > > I'd try: > 1. Use ps2ascii instead of ps2pdf+pdftotext. > > 2. Copy all font files from Windows 10 to your CentOS. Maybe put them in > ~/.fonts and see if that could make ps2pdf happy. > I'd recommend, to start, installing msttcorefonts, and see if that helps. mark
2018 Apr 12
2
Windows PC PostScript printer driver -> CUPS data import fails
...und online which allows me to easily import data from Windows Programs. Hopefully others out there are using the system and already have found the answer to my problem. I have installed on my Centos server a virtual CUPS printer which receives a PS file, and then runs 'ps2pdf' and 'pdftotext -layout' to end up with a text file. On the Windows PC's it's simply a case of installing a printer pointing to this server, and using the HP Colour Laster 2800 PS drivers. Now to my problem. We have finally moved onto Windows 10, and now when I try to install the printer that mode...
2020 Sep 07
2
Indexer error after upgrade to 2.3.11.3
...to ${LOCALBASE} in post-patch:, so we cheat and set xpdf's path to /usr/lib. --- src/plugins/fts/decode2text.sh.orig 2017-10-28 12:21:20 UTC +++ src/plugins/fts/decode2text.sh @@ -79,16 +79,20 @@ wait_timeout() { ?LANG=en_US.UTF-8 ?export LANG ?if [ $fmt = "pdf" ]; then -? /usr/bin/pdftotext $path - 2>/dev/null& +? if [ -x /usr/lib/xpdf/pdftotext ]; then +? ? /usr/lib/xpdf/pdftotext $path - 2>/dev/null& +? else +? ? /usr/local/bin/pdftotext $path - 2>/dev/null& +? fi ? ?wait_timeout 2>/dev/null ?elif [ $fmt = "doc" ]; then -? (/usr/bin/catdoc $path; tr...
2008 Jul 30
3
Dealing with image PDF's
...could OCR when no text was returned while trying to process PDF's as a way of dealing with image only PDF's. Here's the bit in omindex.cc that deals with pdf's: } else if (mimetype == "application/pdf") { string safefile = shell_protect(file); string cmd = "pdftotext -enc UTF-8 " + safefile + " -"; try { dump = stdout_to_string(cmd); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return; } I wanted to change it so if nothing (or no strings...
2008 Jul 30
3
Dealing with image PDF's
...could OCR when no text was returned while trying to process PDF's as a way of dealing with image only PDF's. Here's the bit in omindex.cc that deals with pdf's: } else if (mimetype == "application/pdf") { string safefile = shell_protect(file); string cmd = "pdftotext -enc UTF-8 " + safefile + " -"; try { dump = stdout_to_string(cmd); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return; } I wanted to change it so if nothing (or no strings...
2010 Jan 09
4
parsing pdf files
I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost and b) I may be doing this a lot, so I would rather have R grab the information from the pdf file directly. So: is
2018 Apr 12
0
Windows PC PostScript printer driver -> CUPS data import fails
...that .findfont can't find a font file that the PS file is > > asking for. Is it possible that your Windows 10 is printing using some > > new fonts that your CentOS doesn't have? That would make sense > > > > I'd try: > > 1. Use ps2ascii instead of ps2pdf+pdftotext. I did first try ps2ascii as it was the most obvious choice. However, it gave nothing like the output I was expecting. ps2pdf + pdftotext -layout gives me almost exactly how the report originally looked, apart from the occassional alignment issue. > > > > 2. Copy all font files fro...
2014 May 12
0
message-decoder bug for attachments with charset=binary attribute in content-type?
Hello, I have configure dovecot with solr and I wanted to let solr index content of attachments. For testing I have used biabam command line tool to generate emails with attachments. I have found that dovecot with fts_decoder incorrectly decodes these attachments from biabam and therefore pdftotext has reported corrupted PDF. The problem is that biabam generates header with charset=binary and dovecot message decoder tries to process it as UTF8 or non-UTF8 data. ============================================================ --biabam.ZxWVLybiabam.ZxWVLy Content-Type: application/pdf; charset=bi...
2009 Oct 14
2
puzzle using gsub (and encodings maybe)
...uot; -", "-", y) # and gsub works as expected [1] "NEW YORK-NEW ENGLAND" > I'm sure the problem has to do with the way I read the variable x. But even if I change the encoding for x to ASCII, I still cannot do the sub. I get x by reading a pdf file with pdftotext so you will not be able to replicate my issue. Thanks for any suggestions, Adrian
2008 Nov 13
1
readPDF() -- unsure how to install xpdf to make this work?
...latest R-News letter (http://cran.r-project.org/doc/Rnews/ Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In that lovely package, there is a function called 'readPDF()'. In order to use this, ?readPDF says "Note that this PDF reader needs both the tools pdftotext and pdfinfo installed and accessable on your system." These tools are available from http://www.foolabs.com/xpdf/download.html I am able to download this and use it easily from a dos window to convert a pdf file into a txt file. Question: how do i make these tools available to R, so that i...
2006 May 22
7
how to index the result of any instance method
Hi, One of the AAF features is to be able to index results of methods, but I haven''t seen anywhere how to do this. I have a method that returns the full text of a file and I''d like for this to be indexed. Can anyone out there help me out on this one? Tom -- Posted via http://www.ruby-forum.com/.
2014 Jul 22
2
Ayuda Error in `colnames<-`(`*tmp*`, value = c(
...a exepción del último "colnames", como se ve en la siguiente secuencia: > pdf1<-"./PLAN de INSPECCIONES/05_seguridad_ciudadana.pdf" > pdf2<-"./PLAN de INSPECCIONES/2013_21SeguridadCiudadana.pdf" > exe<-"./xpdfbin-win-3.04/xpdfbin-win-3.04/bin32/pdftotext.exe" > system(paste("\"", exe, "\" \"", pdf1, "\"", sep = ""), wait = F) > system(paste("\"", exe, "\" \"", pdf2, "\"", sep = ""), wait = F) > txt1<-sub(".pd...
2006 Aug 11
3
Proposed changes to omindex
...This could be used as a secondary check for incremental indexing (i.e. if the file was touched but not changed don?t replace it) and also to collapse duplicates (COLLAPSE=1). The md5 source code is from the GNU testutils-2.1 package. 4) For files that require command line utility processing (i.e. pdftotext) I have added a --copylocal option. This allows the file to be digested while being copied to the local drive and then the command line utility processes the local file saving multiple reads across the network. If we want to expand this it could be used to build a local cache/backup/repository....
2009 Jan 26
2
Getting data from a PDF-file into R
Hello I have around 200 PDF-documents, containing data i want organized in R as a dataframe. The PDF-documents look like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg or like this; http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg So i want to pull out the data in coloured boxes it become organized like this (just in R instead of