thr3ads.net - similar to: "[GSoC] Questions about project Text-Extraction Libraries"

Displaying 20 results from an estimated 3000 matches similar to: "[GSoC] Questions about project Text-Extraction Libraries"

[GSoC] Questions about project Text-Extraction Libraries

2019 Mar 23

[GSoC] Questions about project Text-Extraction Libraries

Thanks! That was really useful! I wanted to share my approach to this project with the hope that you can give me some feedback. I am think that applying a design that foresees the incorporation of new file formats is the most suitable way to solve the problem. In the attached sketch we can see: * Bug_Box: It is responsible for encapsulating and handling errors. * File_extrator: It presents an

Proposed changes to omindex

2006 Aug 11

Proposed changes to omindex

Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp

readPDF() -- unsure how to install xpdf to make this work?

2008 Nov 13

readPDF() -- unsure how to install xpdf to make this work?

Dear R-Help, I need to convert a set of '.pdf' files into an equivalent set of '.txt' files. This is so that i can do some text mining on the content. In the latest R-News letter (http://cran.r-project.org/doc/Rnews/ Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In that lovely package, there is a function called 'readPDF()'. In order to use

Text-Extraction Libraries for Omindex

2019 Jun 14

Text-Extraction Libraries for Omindex

This is a list with some libraries that I have been looking at. The idea is to discuss the advantages and disadvantages of adding some of these libraries to Xapian. If anyone knows another library that could be add to the list it would be great! Libfreexl: * For Excel (.xls) * Last release: 2018-02 * Info: gaia-gis.it/fossil/freexl/index * License: MPL tri-license

[GSoC] Bug tracker access

2019 Mar 13

[GSoC] Bug tracker access

Hi! My name is Bruno Baruffaldi, I am a Computer Science student from Argentina . I am interested in working for Xapian for GSoC and I have been reading the developers guide. I try to take a look of the bug tracker, but it is seems that I need a username and a password. Is it correct? -- Atte. Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL:

xapian-omega runfilter.cc patch

2008 Jul 29

xapian-omega runfilter.cc patch

Hi, The following patch for runfilter.cc is needed for building xapian-omega on FreeBSD: --- runfilter.cc.orig 2008-07-03 21:16:54.000000000 +0200 +++ runfilter.cc 2008-07-03 21:18:48.000000000 +0200 @@ -25,6 +25,7 @@ #include "safeerrno.h" #include <sys/types.h> #include <stdio.h> +#include <signal.h> #include "safefcntl.h" #ifdef HAVE_SYS_TIME_H

Reading a password-protected PDF

2013 Feb 27

Reading a password-protected PDF

Hello respected developers, I was wondering if it is possible for xapian to read a password-protected PDF. Searches in the archives and google had yield 0 results. I also tried looking at the source code but I could not find the specific one related to this issue. The characteristic of the set of PDF is as: 1. a set of password protected PDF documents 2. all PDF is set with the same password. 3.

pdftotext latest version for CentOS 7

2019 Dec 15

pdftotext latest version for CentOS 7

I have pdftotext 0.26.5, the current version for CentOS 7 and the Mate desktop as far as I can ascertain. The page https://www.xpdfreader.com/pdftotext-man.html seems to suggest that the latest version is 4.02 which seems a gigantic leap ahead. Since I have a Chinese text PDF which I am unable to extract any text from using pdftotext, instead I end up with a collection of garbage Latin

Need Beginner Guide for Matcher Optimisations Project

2013 Mar 04

Need Beginner Guide for Matcher Optimisations Project

Hi, While searching for a project which matches my interest andskill level, I found this project named Matcher Optimization. This project is really challenging and excting from my view point and I would like to be a part of this project. Optimization techniques metioned in the reference links provided will take some time for me to have a good understanding about them. But I am trying to get my

"Complex?" import of pdf files (criminal records) into R table

2009 Oct 15

"Complex?" import of pdf files (criminal records) into R table

Hi there, I'm facing the decision if it would be possible to transform several more or less complex pdf files into an R Table-Format or if it has to be done manually. I think it would be a impudent to expect a complete solution, but I would be grateful if anyone could give me an advice on how the structure of such a R-program could look like, and if it's possible in general. Here

Reading PDF files

2012 Dec 02

Reading PDF files

I need to do text mining on PDF files. I understand there is a readPDF command in tm that can be used. Have read the 2008 posts on converting PDF files to text by Tony Breyal and others. Wondering if the procedure has been standardized in any tutorial or otherwise? Being new to R, I was able to follow only part of the discussion. Any way to get a set of step by step instructions

Indexer error after upgrade to 2.3.11.3

2020 Aug 19

Indexer error after upgrade to 2.3.11.3

Hi, after the upgrade to Dovecot 2.3.11.3, from 2.3.10.1, I see frequently these errors from different users: Aug 18 11:02:35 Panic: indexer-worker(info at domain.com) session=<g71KISOttvS5LNVj:O3ahCyuZO18cYAAAEPCW+w>: file http-client-request.c: line 1232 (http_client_request_send_more): assertion failed: (req->payload_input != NULL) Aug 18 11:02:35 Error: indexer-worker(info at

reading data from a pdf

2005 Oct 22

reading data from a pdf

> Hi, I'm trying to read data from a PDF file.Is it possible to do it > with R? Thanks, Marco If cut and paste to a text file fails, try this: pdftotext (from the xpdf project) or http://pdftohtml.sourceforge.net pdftohtml is a utility which converts PDF files into HTML and XML formats In addition, pdftk, the command line pdf toolkit may be useful http://www.accesspdf.com/pdftk/

Windows PC PostScript printer driver -> CUPS data import fails

2018 Apr 12

Windows PC PostScript printer driver -> CUPS data import fails

Yan Li wrote: > On 04/12/2018 03:08 AM, Gary Stainburn wrote: >> The PDF contains: >> >> ERROR: invalidfileaccess >> OFFENDING COMMAND: .findfont >> OPERAND STACK: >> r >> /usr/share/X11/fonts/Type1/UTBI____.pfa >> --nostringval-- >> true >> NimbusMonL-Regu >> Courier >> --nostringval-- >> Courier >> 4544317

parsing pdf files

2010 Jan 09

parsing pdf files

I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost and b) I may be doing this a lot, so I would rather have R grab the information from the pdf file directly. So: is

Windows PC PostScript printer driver -> CUPS data import fails

2018 Apr 12

Windows PC PostScript printer driver -> CUPS data import fails

Hi all, For some years now I have been using a simple system I found online which allows me to easily import data from Windows Programs. Hopefully others out there are using the system and already have found the answer to my problem. I have installed on my Centos server a virtual CUPS printer which receives a PS file, and then runs 'ps2pdf' and 'pdftotext -layout' to end up

evince

2018 Mar 02

evince

We have some small networks with connectivity to the Internet through firewall routers.? The smallest has one Windows 7 system and three Linux systems including both CentOS 6 and CentOS 7 machines.? The Windows 7 systems have full Adobe packages that are updated regularly and are trouble free. On the Linux systems, evince has been our go to product for viewing and printing .pdf documents.? This

GSoC 2016: Text-Extraction Libraries in Omega

2016 Mar 07

GSoC 2016: Text-Extraction Libraries in Omega

Hi, everyone. I'm a third-year student in Computer Science. I have a few projects (school-related) on Bitbucket <https://bitbucket.org/philipchung/philipchungtech>. I've been looking at the project-ideas list and I'm interested in making Omega use libraries instead of external programs. Right now I'm trying to get Olly's patch that was linked there to apply to the

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

Guys, I was just playing around and added a bit of code to omindex.cc so I could ocr tiff and tif with gocr which seems to work. Here's what it looks like: // Tiff: } else if (startswith(mimetype, "image/tif")) { // Inspired by http://mjr.towers.org.uk/comp/sxw2text string safefile = shell_protect(file); string cmd = "tifftopnm " + safefile + "

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

similar to: [GSoC] Questions about project Text-Extraction Libraries