thr3ads.net - Xapian discuss - [Xapian-discuss] Chinese segmentation [Apr 2011]

If this information is useful, please help other people find it:
Share via:

戴优丽

2011-Apr-21 07:49 UTC

[Xapian-discuss] Chinese segmentation

hello, I have finished reading the papers, and i think it is time to design
my project.
First step will be determine the input characters are Chinese. i see the
past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i
see some other code system such as gbk and big5. i am wondering that should
i just deal with UTF-8 and unicode?

☼ 林永忠 ☼ (Yung-chung Lin)

2011-Apr-21 09:59 UTC

head link

[Xapian-discuss] Chinese segmentation

Hi,

Big5 was designed only for zh_TW, while GBK was designed only for zh_CN.  It
is better to convert everything to Unicode for segmentation.

For converting from Big5/GBK to UTF-8, iconv can serve the purpose.
If you use Perl, you may consider using Encode::HanConvert
http://search.cpan.org/dist/Encode-HanConvert/

If you code in C++, you may consider cjk-tokenizer.
http://code.google.com/p/cjk-tokenizer/

Language detection on character level for Chinese is fairly easy. You just
need to check the range of characters. Detection for Japanese would be
slightly complicated because Japanese is mixed with Kanji, Hiragara, and
Katakana, buf if you add some predefined rules, it is not so complicated.

Best,
Yung-chung Lin

2011/4/21 ??? <daiyli1984 at gmail.com>
> hello, I have finished reading the papers, and i think it is time to design
> my project.
> First step will be determine the input characters are Chinese. i see the
> past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i
> see some other code system such as gbk and big5. i am wondering that should
> i just deal with UTF-8 and unicode?
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

戴优丽

2011-Apr-21 13:00 UTC

head link

[Xapian-discuss] Chinese segmentation

ok, i understand that now, thanks.

? 2011?4?21? ??5:59?? ??? ? (Yung-chung Lin) <henearkrxern at
gmail.com>???
> Hi,
>
> Big5 was designed only for zh_TW, while GBK was designed only for zh_CN.
>  It is better to convert everything to Unicode for segmentation.
>
> For converting from Big5/GBK to UTF-8, iconv can serve the purpose.
> If you use Perl, you may consider using Encode::HanConvert
> http://search.cpan.org/dist/Encode-HanConvert/
>
> If you code in C++, you may consider cjk-tokenizer.
> http://code.google.com/p/cjk-tokenizer/
>
> Language detection on character level for Chinese is fairly easy. You just
> need to check the range of characters. Detection for Japanese would be
> slightly complicated because Japanese is mixed with Kanji, Hiragara, and
> Katakana, buf if you add some predefined rules, it is not so complicated.
>
> Best,
> Yung-chung Lin
>
> 2011/4/21 ??? <daiyli1984 at gmail.com>
>
>> hello, I have finished reading the papers, and i think it is time to
>> design
>> my project.
>> First step will be determine the input characters are Chinese. i see
the
>> past post that cjk-tokenizer is just dealing with UTF-8 and unicode,
but i
>> see some other code system such as gbk and big5. i am wondering that
>> should
>> i just deal with UTF-8 and unicode?
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
>

Possibly Parallel Threads

Search for more apparently analagous threads

Xapian discuss - Apr 2011 - Chinese segmentation

[Xapian-discuss] Chinese segmentation

[Xapian-discuss] Chinese segmentation

[Xapian-discuss] Chinese segmentation

Possibly Parallel Threads