thr3ads.net - Xapian discuss - [Xapian-discuss] omindex character sets [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Homer

2008-Feb-06 09:52 UTC

[Xapian-discuss] omindex character sets

Hi there folks :)

I compiled xapian / omega on a windows box.
the omindex did not work for me, because the indexing seemed to hang, while
indexing text / html. 
the workaround was to use the cygwin omindex ports.

Now the problem:
i was indexing some documents with german special characters inside the document
and inside the path to the document.
when i use the omega search cgi, some special characters in the path to the
document are screwed up.
i think the path is encoded using iso-8859-1 and the main content is endoded
using utf-8.

is this true, or am i just doing some beginners mistakes?
would be nice if someone can tell me how to fix this.

thanks in advance

greets
Homer

Olly Betts

2008-Feb-07 02:09 UTC

head link

[Xapian-discuss] omindex character sets

On Wed, Feb 06, 2008 at 09:52:19AM +0000, Homer wrote:> I compiled xapian / omega on a windows box.
> the omindex did not work for me, because the indexing seemed to hang, while
> indexing text / html.
Did you build with mingw or MSVC?

Can you find out where it hangs by attaching a debugger, or running
omindex under a debugger?
> Now the problem:
> i was indexing some documents with german special characters inside the
document
> and inside the path to the document.
> when i use the omega search cgi, some special characters in the path to the
> document are screwed up.
> i think the path is encoded using iso-8859-1 and the main content is
endoded
> using utf-8.
> 
> is this true, or am i just doing some beginners mistakes?
> would be nice if someone can tell me how to fix this.
It's a bug, though it's not totally clear to me how best to fix this.

So the URL we use as a link just wants to have top-bit-set characterss
"% encoded" (as I believe they already are at display time).

We could just display the URL to the user the same way.  That's a bit
ugly, but it is actually the URL that is being used so it's
"honest"
at least.  It would perhaps be nicer to show the URL with these
characters "decoded" though.  It certainly would when the document
doesn't have a title and we use the URL for the title.

I don't know how MS Windows or Cygwin handles the encoding of filenames.
On Linux you can use the locale as a hint, but that may be incorrect.
In fact different files in the same directory can have different
encodings.

We can tell ISO-8859-1 and UTF-8 apart fairly reliably, by assuming
UTF-8 unless the filename isn't valid UTF-8 in which case we assume
ISO-8859-1.  That works well in practice as the ISO-8859-1 strings it
misinterprets aren't those you'd usually actually encounter.

So perhaps the best answer is to have an OmegaScript command which
performs this transformation.

Cheers,
    Olly

Xapian discuss - Feb 2008 - omindex character sets

[Xapian-discuss] omindex character sets

[Xapian-discuss] omindex character sets