thr3ads.net - Xapian devel - Ask for advice on exact requirements to fix #699 mixed CJK numbers [Mar 2019]

If this information is useful, please help other people find it:
Share via:

outdream

2019-Mar-09 03:41 UTC

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Thanks for your patience.
I'm still confused of what I should do next.

If it's not worth changing anything here as it's a rare case,
sorry for my PR to github before the reply,
maybe you need to close it on github.

For another case, should I optimize current code with
replacing set to a static array?
Or rollback current modification to cjk-tokenizer and
try to do some work with the stemming?


Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190309/144b6497/attachment.html>

Olly Betts

2019-Mar-11 01:09 UTC

head link

Ask for advice on exact requirements to fix #699 mixed CJK numbers

On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream
wrote:> Thanks for your patience.
> I'm still confused of what I should do next.
> 
> If it's not worth changing anything here as it's a rare case,[...]

I'm not sure I'm really able to judge this part, as I know very little
Chinese...
> Or rollback current modification to cjk-tokenizer and
> try to do some work with the stemming?
...but a way to normalise numbers when indexing and searching seems like
it would address the situation noted in the ticket, but also address the
wider problem of searching for numbers across languages as well as
within a language which has multiple ways of writing a number.  So this
seems like a better solution.

While this normalising of numbers is analogous to stemming of words, I
don't think the number normalising wants to be done in the stemmers as
it's not directly connected to stemming words in the language.

I'd suggest at least to start with to just hard-code the special
handling of numbers (there's already some special handling such
check_infix() vs check_infix_digit()).

It may make sense to abstract out the number normalisation somehow (some
sort of separate "number stemmer" maybe?), but if we try to abstract
it
out to start with it'll take longer to get something working, and it's
quite likely we'll find we got the abstraction wrong and have to rework
it anyway.

Cheers,
    Olly

outdream

2019-Mar-12 07:06 UTC

head link

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Thanks for your considerate suggestion.
I think it maybe the most suitable measure for current case.

I plan to fix the issue with adding cases to check_infix() and
check_infix_digit().
For mixed numbers likes '2千3百' which
starts with an Arabic digit likes 2, would be tokenized as one token.
And with the compiler optimization to 'switch',
I think the efficiency would also be enough.

If you have more tips for the implementation,
or I have any misunderstanding, please tell me.

Cheers,
outdream

Olly Betts <olly at survex.com> 于2019年3月11日周一 上午9:09写道：
> On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:
> > Thanks for your patience.
> > I'm still confused of what I should do next.
> >
> > If it's not worth changing anything here as it's a rare case,
> [...]
>
> I'm not sure I'm really able to judge this part, as I know very
little
> Chinese...
>
> > Or rollback current modification to cjk-tokenizer and
> > try to do some work with the stemming?
>
> ...but a way to normalise numbers when indexing and searching seems like
> it would address the situation noted in the ticket, but also address the
> wider problem of searching for numbers across languages as well as
> within a language which has multiple ways of writing a number.  So this
> seems like a better solution.
>
> While this normalising of numbers is analogous to stemming of words, I
> don't think the number normalising wants to be done in the stemmers as
> it's not directly connected to stemming words in the language.
>
> I'd suggest at least to start with to just hard-code the special
> handling of numbers (there's already some special handling such
> check_infix() vs check_infix_digit()).
>
> It may make sense to abstract out the number normalisation somehow (some
> sort of separate "number stemmer" maybe?), but if we try to
abstract it
> out to start with it'll take longer to get something working, and
it's
> quite likely we'll find we got the abstraction wrong and have to rework
> it anyway.
>
> Cheers,
>     Olly
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190312/101ebb29/attachment.html>

Xapian devel - Mar 2019 - Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers