I tried running omindex on the following file, which is a UTF-8 web page with mixed English and Japanese text. http://www.mail-archive.com/axis-user-ja@ws.apache.org/msg00058.html An English query with Omega mostly worked. The only problem was the summary results were displayed as gibberish - looked like UTF-8 data against a Latin-1 character set. I suspect this issue is easily fixed by tacking on a UTF-8 META tag in the search interface. More seriously, Japanese searches didn't seem to work at all. Cutting and pasting a few words into the browser yielded no results. Additionally, the UTF-8 quere was escaped into character entity referencess; e.g. a query for ?? got me a blank result page with the query listed as 皆様 Any comments? I was really surprised, since Omega did so well in an earlier test against a similar UTF-8 document written in Danish. Is this a matter of polish or are there deeper barriers, like a lack of word splitting capability for languages like Chinese/Japanese/Korean?
On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:> I tried running omindex on the following file, which is a > UTF-8 web page with mixed English and Japanese text.[...]> Any comments? I was really surprised, since Omega did so well > in an earlier test against a similar UTF-8 document written in Danish. > Is this a matter of polish or are there deeper barriers, like a lack of > word splitting capability for languages like Chinese/Japanese/Korean?omindex (and the QueryParser) has somewhat primitive, European-centric, word splitting. The tricky bit is actually for the query parser ... you could either make it so you have to specify the language you're searching in, and set splitting and stemming appropriately (or auto-detect the language), or parse it all possible ways (based on which languages exist in your database) and merge the results somehow. Ultimately it would be nice to support this kind of thing. The first step is UTF-8 support, which Olly has been working on. On top of that we'd need word splitting algorithm for CJK (and anything else that we can't throw English-like rules at). My understanding is that there isn't a good stemming strategy for CJK, so we'd just disable it there. Lots of work to make this sort of thing work automatically. If anyone knows about word splitting for CJK, that'd be a huge help ... James -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
On 8/11/06, "Jeff Breidenbach" <breidenbach@gmail.com> wrote:> > And what about automatic language detection? > > That would help me also tremendously as I have about 60% english, 20% > > german, 5% french, 5% korean, 5% japanese, and 5% italian. > > Not sure why this is useful, except perhaps for stemming. Even then > you will be in trouble for mixed language documents. Seems a little > outside the scope of Xapian, at least from my newbie perspective. > Anyway, there's a couple n-gram based language detectors in open > source land which work fairly well, but the error rate is noticible. >I am using libtextcat (http://software.wise-guys.nl/libtextcat/) for Pinot. It's pretty accurate, at least with the few European languages I tried. Korean and Japanese are supported too apparently... Fabrice