Displaying 9 results from an estimated 9 matches for "antiword".
2009 Apr 29
1
antiword
Hi guys,
I've been noticing more and more that antiword has trouble with many
word documents.
It may look like it's converted a document but leaves out headings and
bits of text.
I've been looking into getting openoffice to do it in headless mode but
still have a way to go before it's stable.
I was wondering if anyone else had any luck on...
2009 Apr 06
2
omindex => Unknown extension
...metimes misses to recognize the extension of
some files (.doc, .pdf) and skips them. In the same run, omindex is
otherwise perfectly able to index other files with same extensions. The
reason is not clear but it should occur before it selects a content
converter since for example, if I manually run antiword on a .doc file
that failed, it works...
Running omindex:
Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping
Manual conversion:
host:/srv # antiword "/srv/xapian/targets/dir/subdir/file name.doc"
<..plain text content of the file...>
host:/srv #...
2007 Apr 01
10
indexing mostly-binary documents (.ppt)
Here''s an interesting problem: In my app, we are indexing various
types of documents, including microsoft powerpoint. Powerpoint
documents are mostly binary, but have a bunch of text (all of the
text in the document?) as well.
My thinking is that the binary will never get searched for, and the
proper text will be indexed and queried as expected, so the indexed
binary will never
2008 Oct 15
3
Extract text from Microsoft PowerPoint files
Hello CentOS people,
I'm wondering if there are command tools like antiword and docx2txt for
Microsoft PowerPoint files (.ppt and .pptx). The idea is to extract
text from PowerPoint files. Sorry this isn't exactly about CentOS, but
I'd really like it if Yum has something. I tried xlhtml, but it hasn't
been updated in a while and isn't exactly wanting...
2016 Sep 27
1
omega issues/notes
All,
I've run into a couple of things using omega/omindex under cygwin. I don't
think I'd attribute them to xapian, omega or omindex, but wanted to get
them out to the list so that if anyone else should run into these things
down the road, hopefully someone will remember and be able to help.
1) after compiling and building omega, and doing make install, I get a set
violation when
2009 Aug 05
2
reading and frequency analysis of Spanish text
For an historical paper I'm working on, I have some Spanish plaintext,
presently in the form of a Word .doc
file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc
and also some ciphered text from the same original source. The ultimate
goal is to use some
frequency analysis of letters and word lengths in the plaintext to help
decode the
2018 Jun 15
0
CRAN Check warnings with GCC 8.1
...N packages
>
> Amelia C50 Cubist Cyclops DetSel GENLIB IRISSeismic KSgeneral
> MigClim MonetDBLite Numero OpenMx PBSmapping PSPManalysis
> PropClust RArcInfo RandomFields RandomFieldsUtils RcppMsgPack
> RcppParallel RcppRedis RecordLinkage Rmalschains RnavGraph Rvcg
> RxODE SiMRiv antiword bigrquery bsamGP catnet coxme dbarts
> dggridR divest dpglasso earth epanet2toolkit fs gap geojsonsf
> gglasso graphql hashmap haven hier.part imager iptools jiebaR
> kernlab lpridge lvec mlvocab mongolite nandb ore phreeqc polyclip
> qtbase rbamtools rebmix rexpokit rgdal rioja rlas rp...
2009 Aug 17
2
reading in MS Word files
I am familiar with packages that read and write Excel files on both Windows
and Linux platforms.
Do any packages provide similar functionality for MS Word files? I have a
lot of text processing to do and the text is embedded in ~200 different Word
files (.doc format Office 2003). All I need to do is read, not write.
Thanks,
Mark
------------------------------------------------------------
Mark
2009 Aug 28
2
OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers
...anyone have experience with linux tools to parse the text from
common non-text file formats for searching? I'm trying to use the
kinosearch add-on for twiki which is fine as far as the search goes, but
it takes forever to generate the index. It uses xpdf to extract strings
from pdf's, antiword for .doc, and since it is perl, the
Spreadsheet::ParseExcel module for .xls. Some documents parse/index
quickly, some extremely slowly, and in the .xls case some seem to hang
forever. I think the real issue is when the parsers (correctly or
incorrectly) detect a wide character set and the ind...