Hello dear list, I'm trying to index various types of files with Xapian, used in a Python program. Text and HTML work fine via index_text() but I can't find any explanations for indexing other types of files. Is it the case that _everyting_ has to be converted to text prior to indexing it? I didn't find a definitive answer to that anywhere on the WWW, some mailing lists and the Xapian documentation. (I only found references to e.g. pdf2text and the like) I was thinking, from reading Xapaian's features page, that it can natively index a vast amount of different file types. If I do need to convert everything to text first, that would mean Xapian can - in reality - only work with plain text, which would make it rather useless for my purpose. Thanks in advance for sharing any insights, Florian
Florian Beer wrote:> Hello dear list, > > I'm trying to index various types of files with Xapian, used in a > Python program. > Text and HTML work fine via index_text() but I can't find any > explanations for indexing other types of files. > > Is it the case that _everyting_ has to be converted to text prior to > indexing it? > I didn't find a definitive answer to that anywhere on the WWW, some > mailing lists and the Xapian documentation. > (I only found references to e.g. pdf2text and the like)Yes. However you can do this using the provided application Omega, in particular the program Omindex. You can find this on the Xapian website. Charlie> > I was thinking, from reading Xapaian's features page, that it can > natively index a vast amount of different file types. If I do need to > convert everything to text first, that would mean Xapian can - in > reality - only work with plain text, which would make it rather > useless for my purpose. > > Thanks in advance for sharing any insights, > Florian > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
I guess this has mostly been covered already, but I think it's worth explicitly addressing the high-level "WHY"... On Wed, Nov 05, 2008 at 11:44:00AM +0100, Florian Beer wrote:> I was thinking, from reading Xapaian's features page, that it can > natively index a vast amount of different file types.The Xapian API doesn't natively support extracting text from any filetypes. There are already good quality open source converters for most common formats, so it's not a productive use of our time to duplicate that work. In general, we resist adding features that aren't "search", particularly if they can already be done using other existing projects. This keeps down the amount of code we need to write and maintain, and allows us to focus our efforts on making Xapian as good as we can at what it does do. Sometimes there's a good argument for including something - e.g. Unicode support is required from inside the QueryParser and TermGenerator classes, and we make this available via the API since we have to maintain the code anyway, and you really want to be using consistent character classifications, etc and they can change slightly between Unicode versions.> If I do need to convert everything to text first, that would mean > Xapian can - in reality - only work with plain text, which would make > it rather useless for my purpose.To index text from non-plaintext formats, just use a conversion library or utility to extract text, and Xapian for the indexing/searching part. Cheers, Olly