Here''s an interesting problem: In my app, we are indexing various types of documents, including microsoft powerpoint. Powerpoint documents are mostly binary, but have a bunch of text (all of the text in the document?) as well. My thinking is that the binary will never get searched for, and the proper text will be indexed and queried as expected, so the indexed binary will never affect results. Is this correct? Then my colleague mentioned that maybe the indexed garbage would affect the weighting of certain searches? I figure that weighting is only per-search so, same situation as above, only the proper terms will be calculated. What do you folks think? John
On Apr 1, 2007, at 3:09 AM, John Bachir wrote:> Here''s an interesting problem: In my app, we are indexing various > types of documents, including microsoft powerpoint. Powerpoint > documents are mostly binary, but have a bunch of text (all of the > text in the document?) as well.Are you serious? You''re adding raw, unprocessed PPT files to your index? Now this is just wrong. PPT files may contain all sorts of binary data, such as images and videos. I just had a look at the sample presentation that came with my Office installation. This file is 3.5MB in size with a (plain text) payload of less than 1KB. I''m sure there''s some tool available which converts PPT to plain text and I strongly recommend you go out and find it. Cheers, Andy
John Joseph Bachir
2007-Apr-01 16:11 UTC
[Ferret-talk] indexing mostly-binary documents (.ppt)
On Apr 1, 2007, at 5:37 AM, Andreas Korth wrote:> Are you serious? You''re adding raw, unprocessed PPT files to your > index? > > Now this is just wrong. PPT files may contain all sorts of binary > data, such as images and videos. I just had a look at the sample > presentation that came with my Office installation. This file is > 3.5MB in size with a (plain text) payload of less than 1KB.As I stated in my previous email, I am conjecturing that indexing these documents will not affect search performance. Do you disagree?> I''m sure there''s some tool available which converts PPT to plain text > and I strongly recommend you go out and find it.I''ve searched far and wide and have found none. john
Florian Gilcher
2007-Apr-01 16:47 UTC
[Ferret-talk] indexing mostly-binary documents (.ppt)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, there are some: Catdoc and Antiword for example. Simple shell commands to extract text from files. http://www.45.free.net/~vitus/software/catdoc/ http://www.winfield.demon.nl/ Antiword has better windows support, but as far as I know doesn''t support .ppt as well as catdoc. I''m no expert though, just used it once or twice at the university. If you use them, I would be interested in feedback on how well it works. Thanks in advance and good luck Florian P.S.: There is an article about even more of them at http://www.linux.com/article.pl?sid=06/02/22/201247 . John Joseph Bachir wrote:> On Apr 1, 2007, at 5:37 AM, Andreas Korth wrote: >> Are you serious? You''re adding raw, unprocessed PPT files to your >> index? >> >> Now this is just wrong. PPT files may contain all sorts of binary >> data, such as images and videos. I just had a look at the sample >> presentation that came with my Office installation. This file is >> 3.5MB in size with a (plain text) payload of less than 1KB. > > As I stated in my previous email, I am conjecturing that indexing > these documents will not affect search performance. Do you disagree? > > > >> I''m sure there''s some tool available which converts PPT to plain text >> and I strongly recommend you go out and find it. > > > I''ve searched far and wide and have found none. > > john > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGD+ID8RlGMqQ8m7oRAoFjAKCfgIzDsFnl+gKgnHQKI11yAkhTYQCfQpx3 fa5wJ2SaE2JlLzQABqxJe7Q=AX5Q -----END PGP SIGNATURE-----
> John Joseph Bachir wrote: >> On Apr 1, 2007, at 5:37 AM, Andreas Korth wrote: >>> I''m sure there''s some tool available which converts PPT to plain >>> text >>> and I strongly recommend you go out and find it. >> >> I''ve searched far and wide and have found none.On Apr 1, 2007, at 12:47 PM, Florian Gilcher wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > Catdoc and Antiword for example. Simple shell commands to extract text > from files. > > http://www.45.free.net/~vitus/software/catdoc/ > http://www.winfield.demon.nl/ > > Antiword has better windows support, but as far as I know doesn''t > support .ppt as well as catdoc. I''m no expert though, just used it > once > or twice at the university. If you use them, I would be interested in > feedback on how well it works. >Wow, I had never come across catdoc and its siblings, and believe me, I''ve searched far and wide. THANK YOU. btw-- I''m pretty sure antiword does not work on powerpoint: $ antiword powerpoint.ppt This OLE file does not contain a Word document John
On Apr 1, 2007, at 12:47 PM, Florian Gilcher wrote:> Catdoc and Antiword for example. Simple shell commands to extract text > from files. > > http://www.45.free.net/~vitus/software/catdoc/ > http://www.winfield.demon.nl/> If you use them, I would be interested in > feedback on how well it works.I am now using catdoc, catppt, and xls2csv to index all of my documents, and it is working well. The content out of catppt seems to be rather incomplete, but is Good Enough for our purposes. John
On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote:>> Now this is just wrong. PPT files may contain all sorts of binary >> data, such as images and videos. I just had a look at the sample >> presentation that came with my Office installation. This file is >> 3.5MB in size with a (plain text) payload of less than 1KB. > > As I stated in my previous email, I am conjecturing that indexing > these documents will not affect search performance. Do you disagree?I couldn''t disagree more. Question is to what extent does it affect performance.>> I''m sure there''s some tool available which converts PPT to plain text >> and I strongly recommend you go out and find it. > > I''ve searched far and wide and have found none.Seems like you found one now :) Good Luck! -- Andy
John Bachir wrote:> On Apr 1, 2007, at 12:47 PM, Florian Gilcher wrote: >> Catdoc and Antiword for example. Simple shell commands to extract text >> from files. >> >> http://www.45.free.net/~vitus/software/catdoc/ >> http://www.winfield.demon.nl/ > >> If you use them, I would be interested in >> feedback on how well it works. > > I am now using catdoc, catppt, and xls2csv to index all of my > documents, and it is working well. > > The content out of catppt seems to be rather incomplete, but is Good > Enough for our purposes.If you were going to be happy with the plain contents being indexed, I''d suggest just running the powerpoint document through strings before indexing it. I don''t know if catppt does more or less than that, but it''d be useful to compare. -- Alex
On Apr 2, 2007, at 7:02 AM, Alex Young wrote:> If you were going to be happy with the plain contents being > indexed, I''d > suggest just running the powerpoint document through strings before > indexing it. I don''t know if catppt does more or less than that, but > it''d be useful to compare. >I tried, that, the result is a LOT of binary garbage along with the plain text.
On 4/2/07, Andreas Korth <andreas.korth at gmx.net> wrote:> On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote: > >> Now this is just wrong. PPT files may contain all sorts of binary > >> data, such as images and videos. I just had a look at the sample > >> presentation that came with my Office installation. This file is > >> 3.5MB in size with a (plain text) payload of less than 1KB. > > > > As I stated in my previous email, I am conjecturing that indexing > > these documents will not affect search performance. Do you disagree? > > I couldn''t disagree more. Question is to what extent does it affect > performance.Andy is right. Indexing binary data like this can really blow out the size of an index. Indexing natural language you get a lot of common terms so even in an index with millions of documents, you may have only tens of thousands of terms. This has a natural compression effect on the index so it will be a lot smaller than the collection of data that is being indexed. This doesn''t work with binary data so the size of your index will be much larger and you''ll have far more search terms in the index. So it will definitely have an effect on search performance but perhaps not as much as you''d expect. Nevertheless, you''d be much better off extracting the text as others have already said. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
John Joseph Bachir
2007-Apr-09 04:12 UTC
[Ferret-talk] indexing mostly-binary documents (.ppt)
On Apr 6, 2007, at 4:02 AM, David Balmain wrote:> On 4/2/07, Andreas Korth <andreas.korth at gmx.net> wrote: >> On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote: >>>> Now this is just wrong. PPT files may contain all sorts of binary >>>> data, such as images and videos. I just had a look at the sample >>>> presentation that came with my Office installation. This file is >>>> 3.5MB in size with a (plain text) payload of less than 1KB. >>> >>> As I stated in my previous email, I am conjecturing that indexing >>> these documents will not affect search performance. Do you disagree? >> I couldn''t disagree more. Question is to what extent does it affect >> performance. > Andy is right. Indexing binary data like this can really blow out the > size of an index. Indexing natural language you get a lot of common > terms so even in an index with millions of documents, you may have > only tens of thousands of terms. This has a natural compression effect > on the index so it will be a lot smaller than the collection of data > that is being indexed. This doesn''t work with binary data so the size > of your index will be much larger and you''ll have far more search > terms in the index. So it will definitely have an effect on search > performance but perhaps not as much as you''d expect.For the record, by performance I meant the quality of the search (i.e., the results of a search query), and not the speed. I now realize that there is now way for anyone to have known that :) Thanks again for all the ideas, I''m happy as a clam with catdoc/catppt. John