On Thu, Sep 18, 2008 at 7:00 PM, James Aylett
<james-xapian at tartarus.org> wrote:> On Thu, Sep 18, 2008 at 04:11:15AM +0100, Olly Betts wrote:
>
>> > How about XML for the output so we can incorporate any additional
>> > meta-data.
>>
>> That's essentially why Recoll's filters convert to HTML. The
main
>> issue is that it adds the overhead of the external script converting
>> to XML and then omindex parsing the XML to get back to the plain
>> text.
>
> I'm -1 on XML as an intermediate format, and -2 on HTML. I'm
currently
> tending towards the idea that we should initially just implement text,
> since that will solve a lot of problems at a reasonable level (and
> people can still use scriptindex), and then we can think about more
> complex things later. (*Possibly* we could have the filter mechanism
> use any of the internal parsers, meaning if you really wanted to
> convert to HTML and parse that for extra metadata, you could.)
>
I chose a similar approach for Pinot.
There's no mandatory intermediary conversion. Depending on its format,
the output an external filter generates goes either through one of the
built-in filters directly (for basic types plain text, HTML and XML),
or through another filter that can handle that specific format and
return a basic type.
See the configuration file for some examples :
http://svn.berlios.de/wsvn/dijon/trunk/filters/external-filters.xml
The drawback is that document metadata is not extracted, except for
filters like pfdftotext that return some fields as HTML's meta tags.
Fabrice