Displaying 20 results from an estimated 3000 matches similar to: "[GSoC] Questions about project Text-Extraction Libraries"
2019 Mar 23
2
[GSoC] Questions about project Text-Extraction Libraries
Thanks!
That was really useful!
I wanted to share my approach to this project with the hope that you can
give me some feedback.
I am think that applying a design that foresees the incorporation of new
file formats is the most suitable way to solve the problem.
In the attached sketch we can see:
* Bug_Box: It is responsible for encapsulating and handling errors.
* File_extrator: It presents an
2006 Aug 11
3
Proposed changes to omindex
Proposed changes to omindex
Currently Available Items
=========================
1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
indexing.
2) Add the document?s last modified time to the value table (ID 0). This would allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp
2008 Nov 13
1
readPDF() -- unsure how to install xpdf to make this work?
Dear R-Help,
I need to convert a set of '.pdf' files into an equivalent set of
'.txt' files. This is so that i can do some text mining on the
content.
In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
that lovely package, there is a function called 'readPDF()'. In order
to use
2019 Jun 14
2
Text-Extraction Libraries for Omindex
This is a list with some libraries that I have been looking at.
The idea is to discuss the advantages and disadvantages of adding some of
these libraries to Xapian.
If anyone knows another library that could be add to the list it would be
great!
Libfreexl:
* For Excel (.xls)
* Last release: 2018-02
* Info: gaia-gis.it/fossil/freexl/index
* License: MPL tri-license
2019 Mar 13
2
[GSoC] Bug tracker access
Hi!
My name is Bruno Baruffaldi, I am a Computer Science student from Argentina
.
I am interested in working for Xapian for GSoC and I have been reading the
developers guide.
I try to take a look of the bug tracker, but it is seems that I need a
username and a password.
Is it correct?
--
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
2008 Jul 29
1
xapian-omega runfilter.cc patch
Hi,
The following patch for runfilter.cc is needed for building
xapian-omega on FreeBSD:
--- runfilter.cc.orig 2008-07-03 21:16:54.000000000 +0200
+++ runfilter.cc 2008-07-03 21:18:48.000000000 +0200
@@ -25,6 +25,7 @@
#include "safeerrno.h"
#include <sys/types.h>
#include <stdio.h>
+#include <signal.h>
#include "safefcntl.h"
#ifdef HAVE_SYS_TIME_H
2013 Feb 27
2
Reading a password-protected PDF
Hello respected developers,
I was wondering if it is possible for xapian to read a password-protected
PDF. Searches in the archives and google had yield 0 results. I also tried
looking at the source code but I could not find the specific one related to
this issue. The characteristic of the set of PDF is as:
1. a set of password protected PDF documents
2. all PDF is set with the same password.
3.
2019 Dec 15
1
pdftotext latest version for CentOS 7
I have pdftotext 0.26.5, the current version for CentOS 7 and the Mate desktop as far as I can ascertain. The page https://www.xpdfreader.com/pdftotext-man.html seems to suggest that the latest version is 4.02 which seems a gigantic leap ahead.
Since I have a Chinese text PDF which I am unable to extract any text from using pdftotext, instead I end up with a collection of garbage Latin
2013 Mar 04
2
Need Beginner Guide for Matcher Optimisations Project
Hi,
While searching for a project which matches my interest andskill level, I
found this project named Matcher Optimization. This project is really
challenging and excting from my view point and I would like to be a part of
this project.
Optimization techniques metioned in the reference links provided will take
some time for me to have a good understanding about them. But I am trying
to get my
2009 Oct 15
1
"Complex?" import of pdf files (criminal records) into R table
Hi there,
I'm facing the decision if it would be possible to transform several
more or less complex pdf files into an R Table-Format or if it has to be
done manually. I think it would be a impudent to expect a complete
solution, but I would be grateful if anyone could give me an advice on
how the structure of such a R-program could look like, and if it's
possible in general.
Here
2012 Dec 02
1
Reading PDF files
I need to do text mining on PDF files. I understand there is a readPDF
command in tm that can be used. Have read the 2008 posts on converting
PDF files to text by Tony Breyal and others.
Wondering if the procedure has been standardized in any tutorial or
otherwise? Being new to R, I was able to follow only part of the
discussion.
Any way to get a set of step by step instructions
2020 Aug 19
7
Indexer error after upgrade to 2.3.11.3
Hi,
after the upgrade to Dovecot 2.3.11.3, from 2.3.10.1, I see frequently
these errors from different users:
Aug 18 11:02:35 Panic: indexer-worker(info at domain.com)
session=<g71KISOttvS5LNVj:O3ahCyuZO18cYAAAEPCW+w>: file
http-client-request.c: line 1232 (http_client_request_send_more):
assertion failed: (req->payload_input != NULL)
Aug 18 11:02:35 Error: indexer-worker(info at
2005 Oct 22
1
reading data from a pdf
> Hi, I'm trying to read data from a PDF file.Is it possible to do it
> with R? Thanks, Marco
If cut and paste to a text file fails, try this:
pdftotext (from the xpdf project)
or
http://pdftohtml.sourceforge.net
pdftohtml is a utility which converts PDF files into HTML and
XML formats
In addition, pdftk, the command line pdf toolkit may be useful
http://www.accesspdf.com/pdftk/
2018 Apr 12
2
Windows PC PostScript printer driver -> CUPS data import fails
Yan Li wrote:
> On 04/12/2018 03:08 AM, Gary Stainburn wrote:
>> The PDF contains:
>>
>> ERROR: invalidfileaccess
>> OFFENDING COMMAND: .findfont
>> OPERAND STACK:
>> r
>> /usr/share/X11/fonts/Type1/UTBI____.pfa
>> --nostringval--
>> true
>> NimbusMonL-Regu
>> Courier
>> --nostringval--
>> Courier
>> 4544317
2010 Jan 09
4
parsing pdf files
I have a pdf file that I would like to parse into R:
http://www.williams.edu/Registrar/geninfo/faculty.pdf
For now, I open the file in Acrobat by hand, then save it "as text"
and then use readLines(). That works fine but a) I am concerned that
some information may be lost and b) I may be doing this a lot, so I
would rather have R grab the information from the pdf file directly.
So: is
2018 Apr 12
2
Windows PC PostScript printer driver -> CUPS data import fails
Hi all,
For some years now I have been using a simple system I found online which
allows me to easily import data from Windows Programs.
Hopefully others out there are using the system and already have found the
answer to my problem.
I have installed on my Centos server a virtual CUPS printer which receives a
PS file, and then runs 'ps2pdf' and 'pdftotext -layout' to end up
2018 Mar 02
5
evince
We have some small networks with connectivity to the Internet
through firewall routers.? The smallest has one Windows 7
system and three Linux systems including both CentOS 6 and
CentOS 7 machines.? The Windows 7 systems have full Adobe
packages that are updated regularly and are trouble free.
On the Linux systems, evince has been our go to product for
viewing and printing .pdf documents.? This
2016 Mar 07
2
GSoC 2016: Text-Extraction Libraries in Omega
Hi, everyone. I'm a third-year student in Computer Science. I have a few
projects (school-related) on Bitbucket
<https://bitbucket.org/philipchung/philipchungtech>.
I've been looking at the project-ideas list and I'm interested in making
Omega use libraries instead of external programs.
Right now I'm trying to get Olly's patch that was linked there to apply
to the
2008 Jul 30
3
Dealing with image PDF's
Guys,
I was just playing around and added a bit of code to omindex.cc so I
could ocr tiff and tif with gocr which seems to work. Here's what it
looks like:
// Tiff:
} else if (startswith(mimetype, "image/tif"))
{
// Inspired by http://mjr.towers.org.uk/comp/sxw2text
string safefile = shell_protect(file);
string cmd = "tifftopnm " + safefile + "
2008 Jul 30
3
Dealing with image PDF's
Guys,
I was just playing around and added a bit of code to omindex.cc so I
could ocr tiff and tif with gocr which seems to work. Here's what it
looks like:
// Tiff:
} else if (startswith(mimetype, "image/tif"))
{
// Inspired by http://mjr.towers.org.uk/comp/sxw2text
string safefile = shell_protect(file);
string cmd = "tifftopnm " + safefile + "