thr3ads.net - Xapian discuss - [Xapian-discuss] Indexing PDF, DOC etc. [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Florian Beer

2008-Nov-05 10:44 UTC

[Xapian-discuss] Indexing PDF, DOC etc.

Hello dear list,

I'm trying to index various types of files with Xapian, used in a  
Python program.
Text and HTML work fine via index_text() but I can't find any  
explanations for indexing other types of files.

Is it the case that _everyting_ has to be converted to text prior to  
indexing it?
I didn't find a definitive answer to that anywhere on the WWW, some  
mailing lists and the Xapian documentation.
(I only found references to e.g. pdf2text and the like)

I was thinking, from reading Xapaian's features page, that it can  
natively index a vast amount of different file types. If I do need to  
convert everything to text first, that would mean Xapian can - in  
reality - only work with plain text, which would make it rather  
useless for my purpose.

Thanks in advance for sharing any insights,
Florian

Charlie Hull

2008-Nov-05 10:52 UTC

head link

[Xapian-discuss] Indexing PDF, DOC etc.

Florian Beer wrote:> Hello dear list,
> 
> I'm trying to index various types of files with Xapian, used in a  
> Python program.
> Text and HTML work fine via index_text() but I can't find any  
> explanations for indexing other types of files.
> 
> Is it the case that _everyting_ has to be converted to text prior to  
> indexing it?
> I didn't find a definitive answer to that anywhere on the WWW, some  
> mailing lists and the Xapian documentation.
> (I only found references to e.g. pdf2text and the like)
Yes. However you can do this using the provided application Omega, in 
particular the program Omindex. You can find this on the Xapian website.

Charlie> 
> I was thinking, from reading Xapaian's features page, that it can  
> natively index a vast amount of different file types. If I do need to  
> convert everything to text first, that would mean Xapian can - in  
> reality - only work with plain text, which would make it rather  
> useless for my purpose.
> 
> Thanks in advance for sharing any insights,
> Florian
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

Olly Betts

2008-Nov-09 14:50 UTC

head link

[Xapian-discuss] Indexing PDF, DOC etc.

I guess this has mostly been covered already, but I think it's worth
explicitly addressing the high-level "WHY"...

On Wed, Nov 05, 2008 at 11:44:00AM +0100, Florian Beer
wrote:> I was thinking, from reading Xapaian's features page, that it can  
> natively index a vast amount of different file types.
The Xapian API doesn't natively support extracting text from any
filetypes.  There are already good quality open source converters for
most common formats, so it's not a productive use of our time to
duplicate that work.

In general, we resist adding features that aren't "search",
particularly
if they can already be done using other existing projects.  This keeps
down the amount of code we need to write and maintain, and allows us to
focus our efforts on making Xapian as good as we can at what it does do.

Sometimes there's a good argument for including something - e.g. Unicode
support is required from inside the QueryParser and TermGenerator
classes, and we make this available via the API since we have to
maintain the code anyway, and you really want to be using consistent
character classifications, etc and they can change slightly between
Unicode versions.
> If I do need to convert everything to text first, that would mean
> Xapian can - in reality - only work with plain text, which would make
> it rather useless for my purpose.
To index text from non-plaintext formats, just use a conversion library
or utility to extract text, and Xapian for the indexing/searching part.

Cheers,
    Olly

Xapian discuss - Nov 2008 - Indexing PDF, DOC etc.

[Xapian-discuss] Indexing PDF, DOC etc.

[Xapian-discuss] Indexing PDF, DOC etc.

[Xapian-discuss] Indexing PDF, DOC etc.