hello, I have finished reading the papers, and i think it is time to design my project. First step will be determine the input characters are Chinese. i see the past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i see some other code system such as gbk and big5. i am wondering that should i just deal with UTF-8 and unicode?
Hi, Big5 was designed only for zh_TW, while GBK was designed only for zh_CN. It is better to convert everything to Unicode for segmentation. For converting from Big5/GBK to UTF-8, iconv can serve the purpose. If you use Perl, you may consider using Encode::HanConvert http://search.cpan.org/dist/Encode-HanConvert/ If you code in C++, you may consider cjk-tokenizer. http://code.google.com/p/cjk-tokenizer/ Language detection on character level for Chinese is fairly easy. You just need to check the range of characters. Detection for Japanese would be slightly complicated because Japanese is mixed with Kanji, Hiragara, and Katakana, buf if you add some predefined rules, it is not so complicated. Best, Yung-chung Lin 2011/4/21 ??? <daiyli1984 at gmail.com>> hello, I have finished reading the papers, and i think it is time to design > my project. > First step will be determine the input characters are Chinese. i see the > past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i > see some other code system such as gbk and big5. i am wondering that should > i just deal with UTF-8 and unicode? > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
ok, i understand that now, thanks. ? 2011?4?21? ??5:59?? ??? ? (Yung-chung Lin) <henearkrxern at gmail.com>???> Hi, > > Big5 was designed only for zh_TW, while GBK was designed only for zh_CN. > It is better to convert everything to Unicode for segmentation. > > For converting from Big5/GBK to UTF-8, iconv can serve the purpose. > If you use Perl, you may consider using Encode::HanConvert > http://search.cpan.org/dist/Encode-HanConvert/ > > If you code in C++, you may consider cjk-tokenizer. > http://code.google.com/p/cjk-tokenizer/ > > Language detection on character level for Chinese is fairly easy. You just > need to check the range of characters. Detection for Japanese would be > slightly complicated because Japanese is mixed with Kanji, Hiragara, and > Katakana, buf if you add some predefined rules, it is not so complicated. > > Best, > Yung-chung Lin > > 2011/4/21 ??? <daiyli1984 at gmail.com> > >> hello, I have finished reading the papers, and i think it is time to >> design >> my project. >> First step will be determine the input characters are Chinese. i see the >> past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i >> see some other code system such as gbk and big5. i am wondering that >> should >> i just deal with UTF-8 and unicode? >> _______________________________________________ >> Xapian-discuss mailing list >> Xapian-discuss at lists.xapian.org >> http://lists.xapian.org/mailman/listinfo/xapian-discuss >> > >