thr3ads.net - Xapian discuss - [Xapian-discuss] Japanese / UTF-8 support [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Jeff Breidenbach

2006-Aug-10 07:43 UTC

[Xapian-discuss] Japanese / UTF-8 support

I tried running omindex on the following file, which is a
UTF-8 web page with mixed English and Japanese text.

http://www.mail-archive.com/axis-user-ja@ws.apache.org/msg00058.html

An English query with Omega mostly worked. The only problem was
the summary results were displayed as gibberish - looked like UTF-8
data against a Latin-1 character set. I suspect this issue is easily fixed
by tacking on a UTF-8 META tag in the search interface.

More seriously, Japanese searches didn't seem to work at all. Cutting
and pasting a few words into the browser yielded no results. Additionally,
the UTF-8 quere was escaped into character entity referencess; e.g.
a query for ?? got me a blank result page with the query listed as
&#30342;&#27096;

Any comments? I was really surprised, since Omega did so well
in an earlier test against a similar UTF-8 document written in Danish.
Is this a matter of polish or are there deeper barriers, like a lack of
word splitting capability for languages like Chinese/Japanese/Korean?

James Aylett

2006-Aug-10 10:43 UTC

head link

[Xapian-discuss] Japanese / UTF-8 support

On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:
> I tried running omindex on the following file, which is a
> UTF-8 web page with mixed English and Japanese text.
[...]
> Any comments? I was really surprised, since Omega did so well
> in an earlier test against a similar UTF-8 document written in Danish.
> Is this a matter of polish or are there deeper barriers, like a lack of
> word splitting capability for languages like Chinese/Japanese/Korean?
omindex (and the QueryParser) has somewhat primitive,
European-centric, word splitting. The tricky bit is actually for the
query parser ... you could either make it so you have to specify the
language you're searching in, and set splitting and stemming
appropriately (or auto-detect the language), or parse it all possible
ways (based on which languages exist in your database) and merge the
results somehow.

Ultimately it would be nice to support this kind of thing. The first
step is UTF-8 support, which Olly has been working on. On top of that
we'd need word splitting algorithm for CJK (and anything else that we
can't throw English-like rules at). My understanding is that there
isn't a good stemming strategy for CJK, so we'd just disable it there.

Lots of work to make this sort of thing work automatically. If anyone
knows about word splitting for CJK, that'd be a huge help ...

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Fabrice Colin

2006-Aug-11 12:24 UTC

head link

[Xapian-discuss] Re: Japanese / UTF-8 support

On 8/11/06, "Jeff Breidenbach" <breidenbach@gmail.com>
wrote:> > And what about automatic language detection?
> > That would help me also tremendously as I have about 60% english, 20%
> > german, 5% french, 5% korean, 5% japanese, and 5% italian.
>
> Not sure why this is useful, except perhaps for stemming. Even then
> you will be in trouble for mixed language documents. Seems a little
> outside the scope of Xapian, at least from my newbie perspective.
> Anyway, there's a couple n-gram based language detectors in open
> source land which work fairly well, but the error rate is noticible.
>I am using libtextcat (http://software.wise-guys.nl/libtextcat/) for Pinot.
It's pretty accurate, at least with the few European languages I tried.
Korean and Japanese are supported too apparently...

Fabrice

Xapian discuss - Aug 2006 - Japanese / UTF-8 support

[Xapian-discuss] Japanese / UTF-8 support

[Xapian-discuss] Japanese / UTF-8 support

[Xapian-discuss] Re: Japanese / UTF-8 support