Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we could use with Xapian. http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html Cheers -Kevin Duraj
On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj wrote:> I am looking for Chinese Japanese and Korean tokenizer that could can > be use to tokenize terms for CJK languages. I am not very familiar > with these languages however I think that these languages contains one > or more words in one symbol which it make more difficult to tokenize > into searchable terms.I've not investigated Japanese much or Korean at all, but I know a little about Chinese. Chinese "characters" are themselves words, but many words are formed from multiple characters. For example, the Chinese capital Beijing is formed from two characters (which literally mean something like "North Capital"). The difficulty is that Chinese text is usually written without any indication of how the symbols group, so you need an algorithm to identify them if you want to index such groups as terms. I understand that's quite a hard problem. However, perhaps you don't need to do that. You could just index each symbol as a word and use phrase searching, or something like it.> Lucene has CJK Tokenizer ... and I am looking around if there is some > open source that we could use with Xapian. > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.htmlThat doesn't provide much information, but if you can find the source code, you could analyse the algorithm used and if it's any good implement it for use with Xapian. Cheers, Olly
☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern")
2007-Jun-29 03:15 UTC
[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks. Best, Yung-chung Lin On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com> wrote:> Hi, > > I am looking for Chinese Japanese and Korean tokenizer that could can > be use to tokenize terms for CJK languages. I am not very familiar > with these languages however I think that these languages contains one > or more words in one symbol which it make more difficult to tokenize > into searchable terms. > > Lucene has CJK Tokenizer ... and I am looking around if there is some > open source that we could use with Xapian. > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html > > Cheers > -Kevin Duraj > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >-------------- next part -------------- #ifndef __TOKENIZER_H__ #define __TOKENIZER_H__ #include <string> #include <vector> #include <unicode.h> namespace cjk { enum tokenizer_type { TOKENIZER_DEFAULT, TOKENIZER_UNIGRAM }; class tokenizer { private: enum tokenizer_type _type; inline void _convert_unicode_to_char(unicode_char_t &uchar, unsigned char *p); public: tokenizer(); tokenizer(enum tokenizer_type type); ~tokenizer(); void tokenize(std::string &str, std::vector<std::string> &token_list); void tokenize(char *buf, size_t buf_len, std::vector<std::string> &token_list); void split(std::string &str, std::vector<std::string> &token_list); void split(char *buf, size_t buf_len, std::vector<std::string> &token_list); }; }; #endif
☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern")
2007-Jun-29 03:20 UTC
[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
Ah, forgot one point. The tokenizer is dependent on libunicode. Best, Yung-chung Lin On 6/29/07, ? ??? ? (Yung-chung Lin, a.k.a. Kaspar or xern) <henearkrxern@gmail.com> wrote:> A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks. > > Best, > Yung-chung Lin > > On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com> wrote: > > Hi, > > > > I am looking for Chinese Japanese and Korean tokenizer that could can > > be use to tokenize terms for CJK languages. I am not very familiar > > with these languages however I think that these languages contains one > > or more words in one symbol which it make more difficult to tokenize > > into searchable terms. > > > > Lucene has CJK Tokenizer ... and I am looking around if there is some > > open source that we could use with Xapian. > > > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html > > > > Cheers > > -Kevin Duraj > > > > _______________________________________________ > > Xapian-discuss mailing list > > Xapian-discuss@lists.xapian.org > > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > > >
☼ 林永忠 ☼ (Yung-chung Lin)
2007-Jul-05 03:30 UTC
[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
Hi, I have altered the source code so that the tokenizer can deal with n-gram cjk tokenization now. Please go to http://code.google.com/p/cjk-tokenizer/ Thank you. Best, Yung-chung Lin On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com> wrote:> Hi, > > I am looking for Chinese Japanese and Korean tokenizer that could can > be use to tokenize terms for CJK languages. I am not very familiar > with these languages however I think that these languages contains one > or more words in one symbol which it make more difficult to tokenize > into searchable terms. > > Lucene has CJK Tokenizer ... and I am looking around if there is some > open source that we could use with Xapian. > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html > > Cheers > -Kevin Duraj > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:> I have altered the source code so that the tokenizer can deal with > n-gram cjk tokenization now. > Please go to http://code.google.com/p/cjk-tokenizer/I have a question - if I read the code correctly, it treats Unicode code points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and 0xf900-0xfaff: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane Are the omitted characters not relevant here, or is this an oversight? Also the Supplementary Ideographic Plane is ignored, but those are described as seldom used, so I can understand why. Cheers, Olly
☼ 林永忠 ☼ (Yung-chung Lin)
2007-Jul-05 04:03 UTC
[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
You are correct. I forgot to include all CJK characters. I will do it in the next revision. The macros were used in one previous project that was scratched up in a short time. I will fix this. Best, Yung-chung Lin On 7/5/07, Olly Betts <olly@survex.com> wrote:> On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote: > > I have altered the source code so that the tokenizer can deal with > > n-gram cjk tokenization now. > > Please go to http://code.google.com/p/cjk-tokenizer/ > > I have a question - if I read the code correctly, it treats Unicode code > points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite > a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and > 0xf900-0xfaff: > > http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane > > Are the omitted characters not relevant here, or is this an oversight? > > Also the Supplementary Ideographic Plane is ignored, but those are > described as seldom used, so I can understand why. > > Cheers, > Olly >
☼ 林永忠 ☼ (Yung-chung Lin)
2007-Jul-05 10:29 UTC
[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
The code has been updated. Now cjk-tokenizer treats the following characters as CJK characters. Will consider to use some flags to ignore unwanted character blocks during tokenization. // 2E80..2EFF; CJK Radicals Supplement // 3000..303F; CJK Symbols and Punctuation // 3040..309F; Hiragana // 30A0..30FF; Katakana // 3100..312F; Bopomofo // 3130..318F; Hangul Compatibility Jamo // 3190..319F; Kanbun // 31A0..31BF; Bopomofo Extended // 31C0..31EF; CJK Strokes // 31F0..31FF; Katakana Phonetic Extensions // 3200..32FF; Enclosed CJK Letters and Months // 3300..33FF; CJK Compatibility // 3400..4DBF; CJK Unified Ideographs Extension A // 4DC0..4DFF; Yijing Hexagram Symbols // 4E00..9FFF; CJK Unified Ideographs // A700..A71F; Modifier Tone Letters // AC00..D7AF; Hangul Syllables // F900..FAFF; CJK Compatibility Ideographs // FE30..FE4F; CJK Compatibility Forms // FF00..FFEF; Halfwidth and Fullwidth Forms // 20000..2A6DF; CJK Unified Ideographs Extension B // 2F800..2FA1F; CJK Compatibility Ideographs Supplement Best, Yung-chung Lin On 7/5/07, ? ??? ? (Yung-chung Lin) <henearkrxern@gmail.com> wrote:> You are correct. I forgot to include all CJK characters. I will do it > in the next revision. The macros were used in one previous project > that was scratched up in a short time. I will fix this. > > Best, > Yung-chung Lin > > On 7/5/07, Olly Betts <olly@survex.com> wrote: > > On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote: > > > I have altered the source code so that the tokenizer can deal with > > > n-gram cjk tokenization now. > > > Please go to http://code.google.com/p/cjk-tokenizer/ > > > > I have a question - if I read the code correctly, it treats Unicode code > > points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite > > a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and > > 0xf900-0xfaff: > > > > http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane > > > > Are the omitted characters not relevant here, or is this an oversight? > > > > Also the Supplementary Ideographic Plane is ignored, but those are > > described as seldom used, so I can understand why. > > > > Cheers, > > Olly > > >