thr3ads.net - similar to: "patch - Some CJK codepoints are also punctuation"

Displaying 20 results from an estimated 3000 matches similar to: "patch - Some CJK codepoints are also punctuation"

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 08

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote: > I've restarted trac. I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I create a trac issue, too? > Assuming the latter is valid, just removing this block (or removing the > parts of it which are Lu or Ll) should fix the problem as then > tokenisation will switch mode - I tried this and it fixes

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 07

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Thu, Jan 04, 2024 at 05:50:22PM +0100, Robert Stepanek wrote: > Since I am undecided yet if and how to fix this in Xapian I haven't > come up with a pull request. Because trac currently is offline, I > could not file a bug. I hope it's OK to post my analysis here first, > I'll be happy to follow up reporting that bug proper later (should we > conclude that it actually

Pull requests: CJK words and Snippet generator

2016 Sep 19

Pull requests: CJK words and Snippet generator

Olly, sorry for my delayed reply. Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts: > On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote: > > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > > > I think my main concerns are about efficiency [...] > > For the proposed term coverage, the implementation looks up and inserts > > terms into a map. That

Pull requests: CJK words and Snippet generator

2016 Jul 26

Pull requests: CJK words and Snippet generator

Hi, The Cyrus IMAP mail server uses Xapian as search engine. Recently, FastMail has sponsored implementation of two Xapian features: CJK word splitting and a generator for search snippets. I've been working on both features and we would be happy to get them merged into Xapian master. The CJK word tokenizer uses the word segmentation algorithms of the International Components for Unicode

Pull requests: CJK words and Snippet generator

2016 Sep 07

Pull requests: CJK words and Snippet generator

On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > I think my main concerns are about efficiency (since that a major > motivation for the current implementation, so slowing it down would be > annoying), and whether we can just make this the standard behaviour > rather than adding an option. The current implementation is O(n) and I took care to keep it at that. For the proposed term

Ask for advice on exact requirements to fix #699 mixed CJK numbers

2019 Mar 07

Ask for advice on exact requirements to fix #699 mixed CJK numbers

I am working on "#699 Better tokenisation of mixed CJK numbers", and have implemented a partial patch of Chinese for this ticket. Current code works well with special test cases and all tests in xapian-core could still pass. But I'm confused with exact requirements of the question, for how much we could pay with performance on enabling more cases, and if there are better methods to

GSOC 2011- CJK Support

2011 Apr 07

GSOC 2011- CJK Support

Hello, erver one, I am Yongzhi Zhang, a chinese student. I'm interested in CJK Support(also known as Chinese, Japanese, and Korean Support), I have 6 years experience in software development (c/C++ and java) . I want to work on this project "CJK Support", I come from Beijing of china. Chinese is my native language. This is my advantage for ?CJK Support? . I have fixed a bug for

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 09

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek wrote: > Removing the whole block will cause word-breaker to not correctly > handle halfwidth Katakana, such as "??????????" which it would treat > as a single term, whereas it should be two: ??????and ????). > > My pull request causes word-breaker to only handle halfwidth Katakana > and Hangul codepoints as

Pull requests: CJK words and Snippet generator

2016 Jul 29

Pull requests: CJK words and Snippet generator

Hi James, thanks for the feedback. On Thu, Jul 28, 2016, at 00:22, James Aylett wrote: > This sounds great! I know sufficiently little about CJK that I won't > try to comment on that at all :) I've just opened a pull request for the CJK tokenizer: https://github.com/xapian/xapian/pull/114 > I wonder if we can arrange suitable defaults to use your > implementation with the

Icon or CJK fonts in MENU TITLE, is that possible in the future ?

2006 Sep 27

Icon or CJK fonts in MENU TITLE, is that possible in the future ?

First I would like to say thank you to HPA for providing some really nice features in recently syslinux version. About new functions, actually I have another radical idea, since we are in Asia, most of the users here they would like to see some local fonts for the syslinux/pxelinux menu. I am wondering is that possible, in the future, the syslinux/pxelinux menu can support CJK fonts or icon ?

Pull requests: CJK words and Snippet generator

2016 Aug 05

Pull requests: CJK words and Snippet generator

On Thu, Aug 4, 2016, at 15:08, James Aylett wrote: > On Wed, Aug 03, 2016 at 08:17:05PM +0200, rsto at paranoia.at wrote: > > I'll notify you when the CJK pull request passes Travis. > > That's great, thanks! Alright, after lots of fiddling with .travis.yml I finally made the pull request build on Travis' trusty image: https://github.com/xapian/xapian/pull/114 I have

Pull requests: CJK words and Snippet generator

2016 Aug 18

Pull requests: CJK words and Snippet generator

Hi, On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote: > The CJK word segmentation and snippet pull requests both pass Travis > since middle/end of last week. Did you find time to look at them? just checking in if you found time to look at the PRs? It'd be nice to know a tentative timeline, so I can plan if to build next features on top of our local fork or the upstream PRs.

Ask for advice on exact requirements to fix #699 mixed CJK numbers

2019 Mar 09

Ask for advice on exact requirements to fix #699 mixed CJK numbers

Thanks for your patience. I'm still confused of what I should do next. If it's not worth changing anything here as it's a rare case, sorry for my PR to github before the reply, maybe you need to close it on github. For another case, should I optimize current code with replacing set to a static array? Or rollback current modification to cjk-tokenizer and try to do some work with the

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 04

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

I think I found a bug in Xapian 1.5 when using FLAG_WORD_BREAKS for input that contains characters in Unicode Halfwidth and Fullwidth Forms (https://unicode.org/charts/PDF/UFF00.pdf). Since I am undecided yet if and how to fix this in Xapian I haven't come up with a pull request. Because trac currently is offline, I could not file a bug. I hope it's OK to post my analysis here first,

Sweave, cairo_pdf, CJK, ghostscript

2011 Oct 22

Sweave, cairo_pdf, CJK, ghostscript

I have had some fun in the last few days trying to put together an annotated map of China with R and some public GIS data: http://sourceforge.net/projects/outmodedbonsai/files/snpMatrix%20next/1.17.7.11/China_Choropleth_Maps.pdf/download It is done, and rather nice... there are a few issues: - the default pdf() device cannot do CJK with embedded fonts - and cairo_pdf() is not hooked up to

Chinese, Japanese, Korean Tokenizer.

2007 Jun 05

Chinese, Japanese, Korean Tokenizer.

Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we

RFC: Kerning, postscript() and pdf()

2008 Oct 12

RFC: Kerning, postscript() and pdf()

Ei-ji Nakama has pointed out (from another Japanese user, I believe) that postscript() and pdf() have not been handling kerning correctly, and this is a request for opinions about how we should correct it. Kerning is the adjustment of the spacing between letters from their natural width, so that for example 'Yo' is usually typeset with the o closer to the Y than 'Yl' would be.

Pull requests: CJK words and Snippet generator

2016 Aug 03

Pull requests: CJK words and Snippet generator

On Wed, Aug 3, 2016, at 19:26, James Aylett wrote: > On Wed, Aug 03, 2016 at 06:54:32PM +0200, rsto at paranoia.at wrote: > > Oddly enough, the pull request causes Travis to break for clang but not > > for gcc [1]. That's because the clang build process fails for the test > > 'querypairwise1' [2], which AFAIK I didn't touch at all. Is that a > > known

remove Punctuation characters

2006 May 09

remove Punctuation characters

Hi, I want to remove all punctuation characters in a string. I was trying it use a regular expressions but it doesn't work. Here is a sample os what i want: str <- 'ABD - remove de punct, and dot characters.' str <- gsub('[:punct:]','',str) str "'ABD remove de punct and dot characters" is there any function that do this kind of thing? Thanks to

Pull requests: CJK words and Snippet generator

2016 Aug 03

Pull requests: CJK words and Snippet generator

Hi, On Fri, Jul 29, 2016, at 13:45, James Aylett wrote: > On Fri, Jul 29, 2016 at 12:12:25PM +0200, rsto at paranoia.at wrote: > > The FastMail snippet generator has been written when MSet didn't create > > snippets. I'll first compare both implementations to see if there is a > > good reason for them to coexist, or might just as well merge any > > additional

similar to: patch - Some CJK codepoints are also punctuation