thr3ads.net - Xapian discuss - [Xapian-discuss] Indexing Chinese [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Alex Deucher

2006-Jun-27 18:17 UTC

[Xapian-discuss] Indexing Chinese

Has anyone ever indexed documents of Chinese characters?  What's the
best way to break down the text for indexing.  I know context is
important.  My current plan is to index each character and then do
phrase queries on combinations of characters.  Is there a better
approach?

Thanks,

Alex

Olly Betts

2006-Jun-28 03:40 UTC

head link

[Xapian-discuss] Indexing Chinese

On Tue, Jun 27, 2006 at 01:17:46PM -0400, Alex Deucher
wrote:> Has anyone ever indexed documents of Chinese characters?  What's the
> best way to break down the text for indexing.  I know context is
> important.
I understand it's possible to algorithmically split a string of Chinese
characters into words to some extent, but that it's a bit complex and
error prone.
> My current plan is to index each character and then do
> phrase queries on combinations of characters.  Is there a better
> approach?
That could be slow, though it depends on the data.  I don't know
enough about Chinese to say if it's likely to be OK or not.

You could try using an n-gram approach - just index adjacent pairs (or
triples, etc) of characters as terms, and perform the same process on
the query.  Essentially the same idea as yours really, except indexing
the combinations of characters as terms.

Cheers,
    Olly

epaulin

2006-Jun-29 17:09 UTC

head link

[Xapian-discuss] Indexing Chinese

On 6/28/06, Alex Deucher <alexdeucher@gmail.com>
wrote:> Has anyone ever indexed documents of Chinese characters?  What's the
> best way to break down the text for indexing.
>
The most common way to do Chinese word segmentation is called "Maximum
Matching", take a look at this:

http://acl.ldc.upenn.edu/C/C96/C96-1035.pdf

Xapian discuss - Jun 2006 - Indexing Chinese

[Xapian-discuss] Indexing Chinese

[Xapian-discuss] Indexing Chinese

[Xapian-discuss] Indexing Chinese