thr3ads.net - Xapian discuss - [Xapian-discuss] Chinese, Japanese, Korean Tokenizer. [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Kevin Duraj

2007-Jun-05 22:37 UTC

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Hi,

I am looking for Chinese Japanese and Korean tokenizer that could can
be use to tokenize terms for CJK languages. I am not very familiar
with these languages however I think that these languages contains one
or more words in one symbol which it make more difficult to tokenize
into searchable terms.

Lucene has CJK Tokenizer ... and I am looking around if there is some
open source that we could use with Xapian.

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html

Cheers
  -Kevin Duraj

Olly Betts

2007-Jun-06 02:29 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj
wrote:> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
I've not investigated Japanese much or Korean at all, but I know a
little about Chinese.

Chinese "characters" are themselves words, but many words are formed
from multiple characters.  For example, the Chinese capital Beijing is
formed from two characters (which literally mean something like "North
Capital").

The difficulty is that Chinese text is usually written without any
indication of how the symbols group, so you need an algorithm to
identify them if you want to index such groups as terms.  I understand
that's quite a hard problem.

However, perhaps you don't need to do that.  You could just index each
symbol as a word and use phrase searching, or something like it.
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
> 
>
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
That doesn't provide much information, but if you can find the source
code, you could analyse the algorithm used and if it's any good
implement it for use with Xapian.

Cheers,
    Olly

☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern")

2007-Jun-29 03:15 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks.

Best,
Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com>
wrote:> Hi,
>
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
>
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
>
>
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> Cheers
>   -Kevin Duraj
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>-------------- next part --------------
#ifndef __TOKENIZER_H__
#define __TOKENIZER_H__

#include <string>
#include <vector>
#include <unicode.h>

namespace cjk {
    enum tokenizer_type {
        TOKENIZER_DEFAULT,
        TOKENIZER_UNIGRAM
    };
    
    class tokenizer {
        private:
        enum tokenizer_type _type;
        inline void _convert_unicode_to_char(unicode_char_t &uchar,
                                             unsigned char *p);
        public:
        tokenizer();
        tokenizer(enum tokenizer_type type);
        ~tokenizer();
        void tokenize(std::string &str,
                      std::vector<std::string> &token_list);
        void tokenize(char *buf, size_t buf_len,
                      std::vector<std::string> &token_list);
        void split(std::string &str,
                   std::vector<std::string> &token_list);
        void split(char *buf, size_t buf_len,
                   std::vector<std::string> &token_list);
    };
};

#endif

☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern")

2007-Jun-29 03:20 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Ah, forgot one point. The tokenizer is dependent on libunicode.

Best,
Yung-chung Lin

On 6/29/07, ? ??? ? (Yung-chung Lin, a.k.a. Kaspar or xern)
<henearkrxern@gmail.com> wrote:> A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it.
Thanks.
>
> Best,
> Yung-chung Lin
>
> On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com> wrote:
> > Hi,
> >
> > I am looking for Chinese Japanese and Korean tokenizer that could can
> > be use to tokenize terms for CJK languages. I am not very familiar
> > with these languages however I think that these languages contains one
> > or more words in one symbol which it make more difficult to tokenize
> > into searchable terms.
> >
> > Lucene has CJK Tokenizer ... and I am looking around if there is some
> > open source that we could use with Xapian.
> >
> >
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
> >
> > Cheers
> >   -Kevin Duraj
> >
> > _______________________________________________
> > Xapian-discuss mailing list
> > Xapian-discuss@lists.xapian.org
> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >
>
>

☼ 林永忠 ☼ (Yung-chung Lin)

2007-Jul-05 03:30 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Hi,

I have altered the source code so that the tokenizer can deal with
n-gram cjk tokenization now.
Please go to http://code.google.com/p/cjk-tokenizer/
Thank you.

Best,
Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev@gmail.com>
wrote:> Hi,
>
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
>
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
>
>
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> Cheers
>   -Kevin Duraj
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

Olly Betts

2007-Jul-05 03:56 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin)
wrote:> I have altered the source code so that the tokenizer can deal with
> n-gram cjk tokenization now.
> Please go to http://code.google.com/p/cjk-tokenizer/
I have a question - if I read the code correctly, it treats Unicode code
points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
0xf900-0xfaff:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane

Are the omitted characters not relevant here, or is this an oversight?

Also the Supplementary Ideographic Plane is ignored, but those are
described as seldom used, so I can understand why.

Cheers,
    Olly

☼ 林永忠 ☼ (Yung-chung Lin)

2007-Jul-05 04:03 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

You are correct. I forgot to include all CJK characters. I will do it
in the next revision. The macros were used in one previous project
that was scratched up in a short time. I will fix this.

Best,
Yung-chung Lin

On 7/5/07, Olly Betts <olly@survex.com> wrote:> On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung
Lin) wrote:
> > I have altered the source code so that the tokenizer can deal with
> > n-gram cjk tokenization now.
> > Please go to http://code.google.com/p/cjk-tokenizer/
>
> I have a question - if I read the code correctly, it treats Unicode code
> points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
> a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
> 0xf900-0xfaff:
>
>
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane
>
> Are the omitted characters not relevant here, or is this an oversight?
>
> Also the Supplementary Ideographic Plane is ignored, but those are
> described as seldom used, so I can understand why.
>
> Cheers,
>     Olly
>

☼ 林永忠 ☼ (Yung-chung Lin)

2007-Jul-05 10:29 UTC

head link

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

The code has been updated. Now cjk-tokenizer treats the following
characters as CJK characters. Will consider to use some flags to
ignore unwanted character blocks during tokenization.

// 2E80..2EFF; CJK Radicals Supplement
// 3000..303F; CJK Symbols and Punctuation
// 3040..309F; Hiragana
// 30A0..30FF; Katakana
// 3100..312F; Bopomofo
// 3130..318F; Hangul Compatibility Jamo
// 3190..319F; Kanbun
// 31A0..31BF; Bopomofo Extended
// 31C0..31EF; CJK Strokes
// 31F0..31FF; Katakana Phonetic Extensions
// 3200..32FF; Enclosed CJK Letters and Months
// 3300..33FF; CJK Compatibility
// 3400..4DBF; CJK Unified Ideographs Extension A
// 4DC0..4DFF; Yijing Hexagram Symbols
// 4E00..9FFF; CJK Unified Ideographs
// A700..A71F; Modifier Tone Letters
// AC00..D7AF; Hangul Syllables
// F900..FAFF; CJK Compatibility Ideographs
// FE30..FE4F; CJK Compatibility Forms
// FF00..FFEF; Halfwidth and Fullwidth Forms
// 20000..2A6DF; CJK Unified Ideographs Extension B
// 2F800..2FA1F; CJK Compatibility Ideographs Supplement

Best,
Yung-chung Lin

On 7/5/07, ? ??? ? (Yung-chung Lin) <henearkrxern@gmail.com>
wrote:> You are correct. I forgot to include all CJK characters. I will do it
> in the next revision. The macros were used in one previous project
> that was scratched up in a short time. I will fix this.
>
> Best,
> Yung-chung Lin
>
> On 7/5/07, Olly Betts <olly@survex.com> wrote:
> > On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ???
(Yung-chung Lin) wrote:
> > > I have altered the source code so that the tokenizer can deal
with
> > > n-gram cjk tokenization now.
> > > Please go to http://code.google.com/p/cjk-tokenizer/
> >
> > I have a question - if I read the code correctly, it treats Unicode
code
> > points 0x4000 to 0x9fff as CJK characters, but that seems to omit
quite
> > a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
> > 0xf900-0xfaff:
> >
> >
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane
> >
> > Are the omitted characters not relevant here, or is this an oversight?
> >
> > Also the Supplementary Ideographic Plane is ignored, but those are
> > described as seldom used, so I can understand why.
> >
> > Cheers,
> >     Olly
> >
>

Apparently Analagous Threads

Search for more possibly parallel threads

Xapian discuss - Jun 2007 - Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Apparently Analagous Threads