thr3ads.net - search: "antiword"

Displaying 9 results from an estimated 9 matches for "antiword".

2009 Apr 29

antiword

Hi guys, I've been noticing more and more that antiword has trouble with many word documents. It may look like it's converted a document but leaves out headings and bits of text. I've been looking into getting openoffice to do it in headless mode but still have a way to go before it's stable. I was wondering if anyone else had any luck on...

omindex => Unknown extension

2009 Apr 06

omindex => Unknown extension

...metimes misses to recognize the extension of some files (.doc, .pdf) and skips them. In the same run, omindex is otherwise perfectly able to index other files with same extensions. The reason is not clear but it should occur before it selects a content converter since for example, if I manually run antiword on a .doc file that failed, it works... Running omindex: Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping Manual conversion: host:/srv # antiword "/srv/xapian/targets/dir/subdir/file name.doc" <..plain text content of the file...> host:/srv #...

indexing mostly-binary documents (.ppt)

2007 Apr 01

indexing mostly-binary documents (.ppt)

Here''s an interesting problem: In my app, we are indexing various types of documents, including microsoft powerpoint. Powerpoint documents are mostly binary, but have a bunch of text (all of the text in the document?) as well. My thinking is that the binary will never get searched for, and the proper text will be indexed and queried as expected, so the indexed binary will never

Extract text from Microsoft PowerPoint files

2008 Oct 15

Extract text from Microsoft PowerPoint files

Hello CentOS people, I'm wondering if there are command tools like antiword and docx2txt for Microsoft PowerPoint files (.ppt and .pptx). The idea is to extract text from PowerPoint files. Sorry this isn't exactly about CentOS, but I'd really like it if Yum has something. I tried xlhtml, but it hasn't been updated in a while and isn't exactly wanting...

omega issues/notes

2016 Sep 27

omega issues/notes

All, I've run into a couple of things using omega/omindex under cygwin. I don't think I'd attribute them to xapian, omega or omindex, but wanted to get them out to the list so that if anyone else should run into these things down the road, hopefully someone will remember and be able to help. 1) after compiling and building omega, and doing make install, I get a set violation when

reading and frequency analysis of Spanish text

2009 Aug 05

reading and frequency analysis of Spanish text

For an historical paper I'm working on, I have some Spanish plaintext, presently in the form of a Word .doc file, http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc and also some ciphered text from the same original source. The ultimate goal is to use some frequency analysis of letters and word lengths in the plaintext to help decode the

CRAN Check warnings with GCC 8.1

2018 Jun 15

CRAN Check warnings with GCC 8.1

...N packages > > Amelia C50 Cubist Cyclops DetSel GENLIB IRISSeismic KSgeneral > MigClim MonetDBLite Numero OpenMx PBSmapping PSPManalysis > PropClust RArcInfo RandomFields RandomFieldsUtils RcppMsgPack > RcppParallel RcppRedis RecordLinkage Rmalschains RnavGraph Rvcg > RxODE SiMRiv antiword bigrquery bsamGP catnet coxme dbarts > dggridR divest dpglasso earth epanet2toolkit fs gap geojsonsf > gglasso graphql hashmap haven hier.part imager iptools jiebaR > kernlab lpridge lvec mlvocab mongolite nandb ore phreeqc polyclip > qtbase rbamtools rebmix rexpokit rgdal rioja rlas rp...

reading in MS Word files

2009 Aug 17

reading in MS Word files

I am familiar with packages that read and write Excel files on both Windows and Linux platforms. Do any packages provide similar functionality for MS Word files? I have a lot of text processing to do and the text is embedded in ~200 different Word files (.doc format Office 2003). All I need to do is read, not write. Thanks, Mark ------------------------------------------------------------ Mark

OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers

2009 Aug 28

OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers

...anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index. It uses xpdf to extract strings from pdf's, antiword for .doc, and since it is perl, the Spreadsheet::ParseExcel module for .xls. Some documents parse/index quickly, some extremely slowly, and in the .xls case some seem to hang forever. I think the real issue is when the parsers (correctly or incorrectly) detect a wide character set and the ind...

search for: antiword