Hi, (bonjour aussi au francophone) I have installed Xapian 0.9.9 on my Linux Ubuntu dapper drake 6.06 LTS Distribution with "synaptic Package manager". I have a repository with about 200000 files (principaly PDF, doc, xls, ). represent 20Go of data, and I want to index all my files to have a full text search. My first question is : - Can we do that or we can index only html and jpeg? I succeed in using "omindex" function but... - How can we use php file to query the quartz database? Thanks for your help Best regards Ix _________________________________________________________________ Fini les pigeons voyageurs: rejoignez enfin l'?ge moderne de la communication! http://www.communicationevolved.com/fr-ch/
On Friday 16 Mar 2007, iX Gamerz wrote:> I have a repository with about 200000 files (principaly PDF, doc, xls, ). > represent 20Go of data, and I want to index all my files to have a full > text search. > > My first question is : > - Can we do that or we can index only html and jpeg?Personally I would recommend you look at e.g. pdftotext, there are various options for converting Word and other formats, depending on whether you mind looking at commercial software (we use one here, but I'm not going to throw around recommendations). I'm using pdftotext here (the PDFs are produced by a conversion process from a variety of formats including Word). -- http://www.lost.eu/175db
On Fri, Mar 16, 2007 at 05:25:26PM +0000, iX Gamerz wrote:> My first question is : > - Can we do that or we can index only html and jpeg?"Indexing jpeg" isn't very easy. There's an optional text comment metadata field which can be easily extracted, but if you have a jpeg image of some text you'll need to run OCR software over it. If you use "omindex" from Omega for indexing, you can index PDF, doc, and xls (and many other formats) provided you have the appropriate filter programs installed. See the omega documentation for details. There aren't any such filters in the Xapian library since good quality filters for most common formats already exist.> I succeed in using "omindex" function but... > - How can we use php file to query the quartz database?I see from your later mail that you've found the PHP examples. Take a look at the "simplesearch" example. Note that you can use "omindex" to index and your own PHP5 code for searching easily enough. Another approach is to use omega's "xml" template and just parse the XML search results output in PHP. Cheers, Olly