thr3ads.net - Xapian devel - Ask for advice on exact requirements to fix #699 mixed CJK numbers [Mar 2019]

If this information is useful, please help other people find it:
Share via:

outdream

2019-Mar-07 16:55 UTC

Ask for advice on exact requirements to fix #699 mixed CJK numbers

I am working on "#699 Better tokenisation of mixed CJK numbers",
and have implemented a partial patch of Chinese for this ticket.
Current code works well with special test cases and
all tests in xapian-core could still pass.

But I'm confused with exact requirements of the question,
for how much we could pay with performance on enabling more cases,
and if there are better methods to do these?


---
The following are details about current implementation,
potential requirements I have thought, my suspects to
Google's solution from the search results.

---


Current Implementation
==As I am still unclear with the exact requirements,
I haven't pull request to the root repository, but only push the
code to my own fork of it, and it's in> https://github.com/outdream/xapian/tree/defect699-mixed_Chinese_num
I also add the 'git diff' result as attachment as an alternative.
(If it's impolite to add attachments on maillist, please tell me)

(Sorry for the code misalignment, I was confused by the tabSize before,
and got the answer from the documents after pushing to github.
While this email running out my time, I would fix the code in next commit.)

If it's better to create a pull request, please tell me.

(Below is my explanation to the code, in case my code is not clear to read)

current code only supports the cases that mixed Chinese numbers
are embedded into the CJK characters which sent to CJKNgramIterator.
And it would extract the whole number as one token instead of 1-gram.

The code was added in the operator++ of CJKNgramIterator in
cjk-tokenizer.cc,
for considerations of minimizing the modification to existing code and
harm to modularity.

current implementation would pass the test cases below:> { "", "有2千3百", "2千3百:1 有[1]"},
> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2]
钱[4]"},
the conditions to enable this function are:
- the number should start with a Latin digit
- a CJK character before the first Latin digit to
have it sent to CJKNgramIterator.

As The mapping between Unicode and Chinese digits just likes
such:> Chinese 1-5: 一二三四五
> in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94
I can't figure out the rules of Unicode of Chinese digits,
and almost believe that the code-makers didn't consider it :(.

So I check if a character is Chinese digits with a static set stores them.
It would have an effect on performance, so the mixed number would
only be checked if start with a Latin digit.
(For the Unicode, if anyone get the key, please tell me, thanks.)


Potential Requirements
==Below are some test cases I made in which my implement is invalid.
They just show potential requirements I have thought,
but unsupported for considerations on performance.
I sign them with numbers and alias them as ex-1 or ex-2.
(the output are results got from my current definition and code.)

(1)> { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2]
钱[4]"},
> Expected output and expect to be equal, were:
> "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]"
> "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
(2)> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1
有[2] 钱[4]"},
> Expected output and expect to be equal, were:
> "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"
> "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
(3)> { "", "there are 3万4千零1 apples", "3万4千零1:1
apples[4] are[2] there[1] "}
> Expected output and expect to be equal, were:
> "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"
> "3万4千零1:1 apples[4] are[2] there[1] "
ex-1 shows the case mixed number starts with a Chinese digit,
to enable it, my current plan needs to check every CJK char if
it is a Chinese digit, and the cost seems unacceptable.

ex-2 and ex-3 show the cases there is non-CJK-character before
the first Latin digit, so it would be eaten by the TermGenerator,
so the Latin digit won't be sent to CJKNgramIterator.
To enable these cases, in my plan, the mixed numbers would be needed
to solved in the TermGenerator. However, this would both affect the
performance and modularity.

With these considerations, I'm confused about if these cases
should be supported.


Google's Solution?
==Trying to make a better definition with the interface,
I make some suspects based on the search results of "2千3百" from
google.

I suppose they use both the number token
and ngram results as keywords.>From the result and the highlighted text,in the searched keywords list,
maybe besides the whole number token in the list,
they also add result from ngram of the number token.

And I also believe they do mapping (or stemming?) to the number,
as transformed keyword '三百'(3百) and '二千'(2千) appears in the
highlighted text frequently.


However, with all these, I still can't decide how this interface
should be, please give me some advices on the exact requirements
and better methods on solving the question.

Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff_mixed_CJK_numbers.diff
Type: application/octet-stream
Size: 3413 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.obj>

outdream

2019-Mar-08 13:29 UTC

head link

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Sorry for my verbose text in last email...

I have created a PR to the master.
The code partially fixes the problem mentioned in #699,
it supports mixed Chinese numbers sent to CJKNgramIterator,
for example, these test cases would pass:> { "", "有2千3百", "2千3百:1 有[1]"},
> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2]
钱[4]"},
But it won't deal with mixed numbers whose previous character is not a
CJK character, as the first digit would be eaten by the TermGenerator.
for example, below cases would fail:> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1
有[2] 钱[4]"},current output is "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2]
钱[6]"> { "", "there are 3万4千零1 apples", "3万4千零1:1
apples[4] are[2] there[1] "}current output is "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"

I'm not sure if these failed cases should be supported,
as in my current plan, it would need to modify the TermGenerator
and check after every Latin digit.
I'm confused if the cost to support these unusual cases is worthy.
If you have better method to solve it, please give me some tips.

Besides, I'm not sure if taking the whole mixed as one token is suitable,
as users have to input the whole number to get relative results.
I think we could feed both the whole token and ngram results during
tokenisation. please your comments.

(Because of the time difference and my limited English,
I might not reply on time, please your forgiveness.)


Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/53c4f900/attachment.html>

Olly Betts

2019-Mar-09 00:18 UTC

head link

Ask for advice on exact requirements to fix #699 mixed CJK numbers

On Fri, Mar 08, 2019 at 12:55:48AM +0800, outdream
wrote:> I am working on "#699 Better tokenisation of mixed CJK numbers",
> and have implemented a partial patch of Chinese for this ticket.
> Current code works well with special test cases and
> all tests in xapian-core could still pass.
> 
> But I'm confused with exact requirements of the question,
> for how much we could pay with performance on enabling more cases,
> and if there are better methods to do these?
We don't really have exact requirements here.

The history of the ticket is that back in 2011 Dai Youli (one of our
GSoC students that year) pointed out on IRC that 2千3百 was split into
4 terms, which results in a lot of false matches (3千2百 is the obvious
one, but also documents which have those terms nowhere near each other).

I just noted that comment so it didn't just get forgotten.

So really we just have a note that we should perhaps handle this
case "better".

It does seem that such mixed Latin and Chinese numbers aren't very
common - we've not had any other feedback about them in the last 8
years, and you said on IRC that you'd rarely seen them.  So possibly
the resolution for this ticket is to conclude that it's not worth
changing anything here.

We recently merged a segmentation option for CJK which uses ICU
(http://site.icu-project.org/).  I tweaked the code for testing so that
ICU gets passed 2千3百, and it seems ICU currently splits this case into
4 words too, while 二千三百 is split into 二千 and 三百.
> As The mapping between Unicode and Chinese digits just likes such:
> > Chinese 1-5: 一二三四五
> > in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94
> 
> I can't figure out the rules of Unicode of Chinese digits,
> and almost believe that the code-makers didn't consider it :(.
I think sometimes the codepoint allocations match the order in an
existing encoding, but I'm not sure that's the case here.  It
would certainly be more logical for the digits to be consecutive
codepoints, as they are in ASCII.
> So I check if a character is Chinese digits with a static set stores them.
I think we probably want to avoid a static std::set for such a check -
it's likely to need to be initialised at runtime (at least with current
compilers).  Given the list of characters is known at compile-time we
can probably build some sort of fixed mapping for checking these, e.g. a
sorted static const unsigned int array can be searched using
std::binary_search() in O(log(n)), which is the same O() as std::set
gives.
> (1)
> > { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1
有[2] 钱[4]"},
> 
> > Expected output and expect to be equal, were:
> > "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]"
> > "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
> 
> (2)
> > { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1]
我有:1 有[2] 钱[4]"},
> 
> > Expected output and expect to be equal, were:
> > "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"
> > "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
> 
> (3)
> > { "", "there are 3万4千零1 apples", "3万4千零1:1
apples[4] are[2] there[1] "}
> 
> > Expected output and expect to be equal, were:
> > "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"
> > "3万4千零1:1 apples[4] are[2] there[1] "
> 
> ex-1 shows the case mixed number starts with a Chinese digit,
> to enable it, my current plan needs to check every CJK char if
> it is a Chinese digit, and the cost seems unacceptable.
> 
> ex-2 and ex-3 show the cases there is non-CJK-character before
> the first Latin digit, so it would be eaten by the TermGenerator,
> so the Latin digit won't be sent to CJKNgramIterator.
> To enable these cases, in my plan, the mixed numbers would be needed
> to solved in the TermGenerator. However, this would both affect the
> performance and modularity.
> 
> With these considerations, I'm confused about if these cases
> should be supported.
I think the questions to ask are whether these cases occur in practice,
and how well they work without special handling vs with special
handling.

If we can come up with a set of "use cases" of numbers which seems we
should be able to handle better then we can think about what we can
achieve to improve things, and how to implement that cleanly and
efficiently.
> I make some suspects based on the search results of "2千3百" from
google.
> 
> I suppose they use both the number token
> and ngram results as keywords.
> >From the result and the highlighted text,
> in the searched keywords list,
> maybe besides the whole number token in the list,
> they also add result from ngram of the number token.
> 
> And I also believe they do mapping (or stemming?) to the number,
> as transformed keyword '三百'(3百) and '二千'(2千) appears in the
> highlighted text frequently.
I'm not sure exactly what they're doing.

But I think a plausible approach along those sort of lines would be
to aim to normalise numbers written in different scripts to a single
form, so 2千3百, 二千三百, and 2300 would all be indexed as
the same term (and eventually so would 2300 represented in Arabic and
other scripts), so the user could search for any of these and find
documents using the others.  Sort of like how terms from English text
are case-folded and stemmed.

Cheers,
    Olly

outdream

2019-Mar-09 03:41 UTC

head link

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Thanks for your patience.
I'm still confused of what I should do next.

If it's not worth changing anything here as it's a rare case,
sorry for my PR to github before the reply,
maybe you need to close it on github.

For another case, should I optimize current code with
replacing set to a static array?
Or rollback current modification to cjk-tokenizer and
try to do some work with the stemming?


Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190309/144b6497/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

Xapian devel - Mar 2019 - Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Apparently Analagous Threads