On Wed, Mar 28, 2012 at 09:51:50PM -0400, Julia Wilson
wrote:> I know Japanese - I'm not a native speaker by any means, but I'm
pretty
> good - and I'm really interested in the specifics of how dealing with
> Japanese text differs from dealing with English and other languages that
> use roman scripts. Currently I'm doing some research on improving
Japanese
> to English translation algorithms using linguistic data (topicalization and
> pronoun resolution, specifically).
It sounds like your knowledge of Japanese wouldn't be an issue.
> The suggested project description mentions switching to a language-specific
> segmentation algorithm; since a particular algorithm isn't specified
I'm
> guessing that the evaluation and selection of an algorithm would
> necessarily be part of the project? I've worked with Japanese text
mostly
> in Python and so I've used TinySegmenter in
>
Python<https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py>
Wow, that certainly lives up to its name. How effective is it?
> when
> I needed segmentation; I've run across a couple of other options that
were
> available in C++ but I haven't really had occasion to use them. So
> basically, I've dealt with Japanese segmenters before, but am
interested to
> know if you had anything specific in mind.
We don't have anything particularly in mind, though mecab has been
mentioned both last year and this:
http://code.google.com/p/mecab/
Evaluation and selection could certainly be part of the project.
There's a plan to try to get to the point where we can relicense
Xapian (probably as MIT/X), so segmentation libraries with a liberal
licence (such as MIT/X or new BSD) or perhaps LGPL would be better.
> I've been looking at the code in xapian-core/languages and
> xapian-core/queryparser to get an idea of how other languages are
> implemented and used. Am I correct in assuming that the ultimate goal is to
> deprecate the n-gram based tokenizer in favor of individual ways of
> handling Japanese and Korean (and presumably Chinese)? That is, is the idea
> to make language support entirely language-specific, or would there still
> be kind of a generic "CJK Stuff" class as well? Also, is there
anything
> else beyond what I mentioned that I should make a point of looking at in
> the code base to better understand what would need to be done?
It probably would make sense to deprecate the n-gram approach once we
have support for segmenters for all the languages which need it.
There was a project implementing a Chinese segmentation algorithm in
GSoC last year, which isn't merged yet, and also a patch to support a
third party segmenter, which is why there's no corresponding
"Chinese"
idea this year - we need too consolidate what we already have there
first really.
I understand segmentation is also relevant to old-style Vietnamese,
but modern Vietnamese uses a Latin alphabet:
http://en.wikipedia.org/wiki/Vietnamese_language
I don't know if there are any other languages.
Cheers,
Olly