outdream
2019-Mar-09 03:41 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
Thanks for your patience. I'm still confused of what I should do next. If it's not worth changing anything here as it's a rare case, sorry for my PR to github before the reply, maybe you need to close it on github. For another case, should I optimize current code with replacing set to a static array? Or rollback current modification to cjk-tokenizer and try to do some work with the stemming? Cheers, outdream -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190309/144b6497/attachment.html>
Olly Betts
2019-Mar-11 01:09 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:> Thanks for your patience. > I'm still confused of what I should do next. > > If it's not worth changing anything here as it's a rare case,[...] I'm not sure I'm really able to judge this part, as I know very little Chinese...> Or rollback current modification to cjk-tokenizer and > try to do some work with the stemming?...but a way to normalise numbers when indexing and searching seems like it would address the situation noted in the ticket, but also address the wider problem of searching for numbers across languages as well as within a language which has multiple ways of writing a number. So this seems like a better solution. While this normalising of numbers is analogous to stemming of words, I don't think the number normalising wants to be done in the stemmers as it's not directly connected to stemming words in the language. I'd suggest at least to start with to just hard-code the special handling of numbers (there's already some special handling such check_infix() vs check_infix_digit()). It may make sense to abstract out the number normalisation somehow (some sort of separate "number stemmer" maybe?), but if we try to abstract it out to start with it'll take longer to get something working, and it's quite likely we'll find we got the abstraction wrong and have to rework it anyway. Cheers, Olly
outdream
2019-Mar-12 07:06 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
Thanks for your considerate suggestion. I think it maybe the most suitable measure for current case. I plan to fix the issue with adding cases to check_infix() and check_infix_digit(). For mixed numbers likes '2千3百' which starts with an Arabic digit likes 2, would be tokenized as one token. And with the compiler optimization to 'switch', I think the efficiency would also be enough. If you have more tips for the implementation, or I have any misunderstanding, please tell me. Cheers, outdream Olly Betts <olly at survex.com> 于2019年3月11日周一 上午9:09写道:> On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote: > > Thanks for your patience. > > I'm still confused of what I should do next. > > > > If it's not worth changing anything here as it's a rare case, > [...] > > I'm not sure I'm really able to judge this part, as I know very little > Chinese... > > > Or rollback current modification to cjk-tokenizer and > > try to do some work with the stemming? > > ...but a way to normalise numbers when indexing and searching seems like > it would address the situation noted in the ticket, but also address the > wider problem of searching for numbers across languages as well as > within a language which has multiple ways of writing a number. So this > seems like a better solution. > > While this normalising of numbers is analogous to stemming of words, I > don't think the number normalising wants to be done in the stemmers as > it's not directly connected to stemming words in the language. > > I'd suggest at least to start with to just hard-code the special > handling of numbers (there's already some special handling such > check_infix() vs check_infix_digit()). > > It may make sense to abstract out the number normalisation somehow (some > sort of separate "number stemmer" maybe?), but if we try to abstract it > out to start with it'll take longer to get something working, and it's > quite likely we'll find we got the abstraction wrong and have to rework > it anyway. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190312/101ebb29/attachment.html>