outdream
2019-Mar-07 16:55 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
I am working on "#699 Better tokenisation of mixed CJK numbers", and have implemented a partial patch of Chinese for this ticket. Current code works well with special test cases and all tests in xapian-core could still pass. But I'm confused with exact requirements of the question, for how much we could pay with performance on enabling more cases, and if there are better methods to do these? --- The following are details about current implementation, potential requirements I have thought, my suspects to Google's solution from the search results. --- Current Implementation ==As I am still unclear with the exact requirements, I haven't pull request to the root repository, but only push the code to my own fork of it, and it's in> https://github.com/outdream/xapian/tree/defect699-mixed_Chinese_numI also add the 'git diff' result as attachment as an alternative. (If it's impolite to add attachments on maillist, please tell me) (Sorry for the code misalignment, I was confused by the tabSize before, and got the answer from the documents after pushing to github. While this email running out my time, I would fix the code in next commit.) If it's better to create a pull request, please tell me. (Below is my explanation to the code, in case my code is not clear to read) current code only supports the cases that mixed Chinese numbers are embedded into the CJK characters which sent to CJKNgramIterator. And it would extract the whole number as one token instead of 1-gram. The code was added in the operator++ of CJKNgramIterator in cjk-tokenizer.cc, for considerations of minimizing the modification to existing code and harm to modularity. current implementation would pass the test cases below:> { "", "有2千3百", "2千3百:1 有[1]"}, > { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},the conditions to enable this function are: - the number should start with a Latin digit - a CJK character before the first Latin digit to have it sent to CJKNgramIterator. As The mapping between Unicode and Chinese digits just likes such:> Chinese 1-5: 一二三四五 > in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94I can't figure out the rules of Unicode of Chinese digits, and almost believe that the code-makers didn't consider it :(. So I check if a character is Chinese digits with a static set stores them. It would have an effect on performance, so the mixed number would only be checked if start with a Latin digit. (For the Unicode, if anyone get the key, please tell me, thanks.) Potential Requirements ==Below are some test cases I made in which my implement is invalid. They just show potential requirements I have thought, but unsupported for considerations on performance. I sign them with numbers and alias them as ex-1 or ex-2. (the output are results got from my current definition and code.) (1)> { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},> Expected output and expect to be equal, were: > "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]" > "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"(2)> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},> Expected output and expect to be equal, were: > "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]" > "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"(3)> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}> Expected output and expect to be equal, were: > "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]" > "3万4千零1:1 apples[4] are[2] there[1] "ex-1 shows the case mixed number starts with a Chinese digit, to enable it, my current plan needs to check every CJK char if it is a Chinese digit, and the cost seems unacceptable. ex-2 and ex-3 show the cases there is non-CJK-character before the first Latin digit, so it would be eaten by the TermGenerator, so the Latin digit won't be sent to CJKNgramIterator. To enable these cases, in my plan, the mixed numbers would be needed to solved in the TermGenerator. However, this would both affect the performance and modularity. With these considerations, I'm confused about if these cases should be supported. Google's Solution? ==Trying to make a better definition with the interface, I make some suspects based on the search results of "2千3百" from google. I suppose they use both the number token and ngram results as keywords.>From the result and the highlighted text,in the searched keywords list, maybe besides the whole number token in the list, they also add result from ngram of the number token. And I also believe they do mapping (or stemming?) to the number, as transformed keyword '三百'(3百) and '二千'(2千) appears in the highlighted text frequently. However, with all these, I still can't decide how this interface should be, please give me some advices on the exact requirements and better methods on solving the question. Cheers, outdream -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: diff_mixed_CJK_numbers.diff Type: application/octet-stream Size: 3413 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.obj>
outdream
2019-Mar-08 13:29 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
Sorry for my verbose text in last email... I have created a PR to the master. The code partially fixes the problem mentioned in #699, it supports mixed Chinese numbers sent to CJKNgramIterator, for example, these test cases would pass:> { "", "有2千3百", "2千3百:1 有[1]"}, > { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},But it won't deal with mixed numbers whose previous character is not a CJK character, as the first digit would be eaten by the TermGenerator. for example, below cases would fail:> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},current output is "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}current output is "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]" I'm not sure if these failed cases should be supported, as in my current plan, it would need to modify the TermGenerator and check after every Latin digit. I'm confused if the cost to support these unusual cases is worthy. If you have better method to solve it, please give me some tips. Besides, I'm not sure if taking the whole mixed as one token is suitable, as users have to input the whole number to get relative results. I think we could feed both the whole token and ngram results during tokenisation. please your comments. (Because of the time difference and my limited English, I might not reply on time, please your forgiveness.) Cheers, outdream -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/53c4f900/attachment.html>
Olly Betts
2019-Mar-09 00:18 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
On Fri, Mar 08, 2019 at 12:55:48AM +0800, outdream wrote:> I am working on "#699 Better tokenisation of mixed CJK numbers", > and have implemented a partial patch of Chinese for this ticket. > Current code works well with special test cases and > all tests in xapian-core could still pass. > > But I'm confused with exact requirements of the question, > for how much we could pay with performance on enabling more cases, > and if there are better methods to do these?We don't really have exact requirements here. The history of the ticket is that back in 2011 Dai Youli (one of our GSoC students that year) pointed out on IRC that 2千3百 was split into 4 terms, which results in a lot of false matches (3千2百 is the obvious one, but also documents which have those terms nowhere near each other). I just noted that comment so it didn't just get forgotten. So really we just have a note that we should perhaps handle this case "better". It does seem that such mixed Latin and Chinese numbers aren't very common - we've not had any other feedback about them in the last 8 years, and you said on IRC that you'd rarely seen them. So possibly the resolution for this ticket is to conclude that it's not worth changing anything here. We recently merged a segmentation option for CJK which uses ICU (http://site.icu-project.org/). I tweaked the code for testing so that ICU gets passed 2千3百, and it seems ICU currently splits this case into 4 words too, while 二千三百 is split into 二千 and 三百.> As The mapping between Unicode and Chinese digits just likes such: > > Chinese 1-5: 一二三四五 > > in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94 > > I can't figure out the rules of Unicode of Chinese digits, > and almost believe that the code-makers didn't consider it :(.I think sometimes the codepoint allocations match the order in an existing encoding, but I'm not sure that's the case here. It would certainly be more logical for the digits to be consecutive codepoints, as they are in ASCII.> So I check if a character is Chinese digits with a static set stores them.I think we probably want to avoid a static std::set for such a check - it's likely to need to be initialised at runtime (at least with current compilers). Given the list of characters is known at compile-time we can probably build some sort of fixed mapping for checking these, e.g. a sorted static const unsigned int array can be searched using std::binary_search() in O(log(n)), which is the same O() as std::set gives.> (1) > > { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"}, > > > Expected output and expect to be equal, were: > > "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]" > > "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]" > > (2) > > { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"}, > > > Expected output and expect to be equal, were: > > "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]" > > "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]" > > (3) > > { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "} > > > Expected output and expect to be equal, were: > > "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]" > > "3万4千零1:1 apples[4] are[2] there[1] " > > ex-1 shows the case mixed number starts with a Chinese digit, > to enable it, my current plan needs to check every CJK char if > it is a Chinese digit, and the cost seems unacceptable. > > ex-2 and ex-3 show the cases there is non-CJK-character before > the first Latin digit, so it would be eaten by the TermGenerator, > so the Latin digit won't be sent to CJKNgramIterator. > To enable these cases, in my plan, the mixed numbers would be needed > to solved in the TermGenerator. However, this would both affect the > performance and modularity. > > With these considerations, I'm confused about if these cases > should be supported.I think the questions to ask are whether these cases occur in practice, and how well they work without special handling vs with special handling. If we can come up with a set of "use cases" of numbers which seems we should be able to handle better then we can think about what we can achieve to improve things, and how to implement that cleanly and efficiently.> I make some suspects based on the search results of "2千3百" from google. > > I suppose they use both the number token > and ngram results as keywords. > >From the result and the highlighted text, > in the searched keywords list, > maybe besides the whole number token in the list, > they also add result from ngram of the number token. > > And I also believe they do mapping (or stemming?) to the number, > as transformed keyword '三百'(3百) and '二千'(2千) appears in the > highlighted text frequently.I'm not sure exactly what they're doing. But I think a plausible approach along those sort of lines would be to aim to normalise numbers written in different scripts to a single form, so 2千3百, 二千三百, and 2300 would all be indexed as the same term (and eventually so would 2300 represented in Arabic and other scripts), so the user could search for any of these and find documents using the others. Sort of like how terms from English text are case-folded and stemmed. Cheers, Olly
outdream
2019-Mar-09 03:41 UTC
Ask for advice on exact requirements to fix #699 mixed CJK numbers
Thanks for your patience. I'm still confused of what I should do next. If it's not worth changing anything here as it's a rare case, sorry for my PR to github before the reply, maybe you need to close it on github. For another case, should I optimize current code with replacing set to a static array? Or rollback current modification to cjk-tokenizer and try to do some work with the stemming? Cheers, outdream -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190309/144b6497/attachment.html>
Possibly Parallel Threads
- Ask for advice on exact requirements to fix #699 mixed CJK numbers
- Pull requests: CJK words and Snippet generator
- Pull requests: CJK words and Snippet generator
- Search requests should ignore accents (C++ API)?
- Problem while writing test cases for api_termgen