search for: antiword

Displaying 9 results from an estimated 9 matches for "antiword".

2009 Apr 29
1
antiword
Hi guys, I've been noticing more and more that antiword has trouble with many word documents. It may look like it's converted a document but leaves out headings and bits of text. I've been looking into getting openoffice to do it in headless mode but still have a way to go before it's stable. I was wondering if anyone else had any luck on...
2009 Apr 06
2
omindex => Unknown extension
...metimes misses to recognize the extension of some files (.doc, .pdf) and skips them. In the same run, omindex is otherwise perfectly able to index other files with same extensions. The reason is not clear but it should occur before it selects a content converter since for example, if I manually run antiword on a .doc file that failed, it works... Running omindex: Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping Manual conversion: host:/srv # antiword "/srv/xapian/targets/dir/subdir/file name.doc" <..plain text content of the file...> host:/srv #...
2007 Apr 01
10
indexing mostly-binary documents (.ppt)
Here''s an interesting problem: In my app, we are indexing various types of documents, including microsoft powerpoint. Powerpoint documents are mostly binary, but have a bunch of text (all of the text in the document?) as well. My thinking is that the binary will never get searched for, and the proper text will be indexed and queried as expected, so the indexed binary will never
2008 Oct 15
3
Extract text from Microsoft PowerPoint files
Hello CentOS people, I'm wondering if there are command tools like antiword and docx2txt for Microsoft PowerPoint files (.ppt and .pptx). The idea is to extract text from PowerPoint files. Sorry this isn't exactly about CentOS, but I'd really like it if Yum has something. I tried xlhtml, but it hasn't been updated in a while and isn't exactly wanting...
2016 Sep 27
1
omega issues/notes
All, I've run into a couple of things using omega/omindex under cygwin. I don't think I'd attribute them to xapian, omega or omindex, but wanted to get them out to the list so that if anyone else should run into these things down the road, hopefully someone will remember and be able to help. 1) after compiling and building omega, and doing make install, I get a set violation when
2009 Aug 05
2
reading and frequency analysis of Spanish text
For an historical paper I'm working on, I have some Spanish plaintext, presently in the form of a Word .doc file, http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc and also some ciphered text from the same original source. The ultimate goal is to use some frequency analysis of letters and word lengths in the plaintext to help decode the
2018 Jun 15
0
CRAN Check warnings with GCC 8.1
...N packages > > Amelia C50 Cubist Cyclops DetSel GENLIB IRISSeismic KSgeneral > MigClim MonetDBLite Numero OpenMx PBSmapping PSPManalysis > PropClust RArcInfo RandomFields RandomFieldsUtils RcppMsgPack > RcppParallel RcppRedis RecordLinkage Rmalschains RnavGraph Rvcg > RxODE SiMRiv antiword bigrquery bsamGP catnet coxme dbarts > dggridR divest dpglasso earth epanet2toolkit fs gap geojsonsf > gglasso graphql hashmap haven hier.part imager iptools jiebaR > kernlab lpridge lvec mlvocab mongolite nandb ore phreeqc polyclip > qtbase rbamtools rebmix rexpokit rgdal rioja rlas rp...
2009 Aug 17
2
reading in MS Word files
I am familiar with packages that read and write Excel files on both Windows and Linux platforms. Do any packages provide similar functionality for MS Word files? I have a lot of text processing to do and the text is embedded in ~200 different Word files (.doc format Office 2003). All I need to do is read, not write. Thanks, Mark ------------------------------------------------------------ Mark
2009 Aug 28
2
OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers
...anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index. It uses xpdf to extract strings from pdf's, antiword for .doc, and since it is perl, the Spreadsheet::ParseExcel module for .xls. Some documents parse/index quickly, some extremely slowly, and in the .xls case some seem to hang forever. I think the real issue is when the parsers (correctly or incorrectly) detect a wide character set and the ind...