Displaying 20 results from an estimated 3000 matches similar to: "patch - Some CJK codepoints are also punctuation"
2024 Jan 08
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote:
> I've restarted trac.
I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I create a trac issue, too?
> Assuming the latter is valid, just removing this block (or removing the
> parts of it which are Lu or Ll) should fix the problem as then
> tokenisation will switch mode - I tried this and it fixes
2024 Jan 07
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
On Thu, Jan 04, 2024 at 05:50:22PM +0100, Robert Stepanek wrote:
> Since I am undecided yet if and how to fix this in Xapian I haven't
> come up with a pull request. Because trac currently is offline, I
> could not file a bug. I hope it's OK to post my analysis here first,
> I'll be happy to follow up reporting that bug proper later (should we
> conclude that it actually
2016 Sep 19
2
Pull requests: CJK words and Snippet generator
Olly, sorry for my delayed reply.
Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:
> On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > I think my main concerns are about efficiency [...]
> > For the proposed term coverage, the implementation looks up and inserts
> > terms into a map. That
2016 Jul 26
2
Pull requests: CJK words and Snippet generator
Hi,
The Cyrus IMAP mail server uses Xapian as search engine. Recently,
FastMail has sponsored implementation of two Xapian features: CJK word
splitting and a generator for search snippets. I've been working on both
features and we would be happy to get them merged into Xapian master.
The CJK word tokenizer uses the word segmentation algorithms of the
International Components for Unicode
2016 Sep 07
2
Pull requests: CJK words and Snippet generator
On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> I think my main concerns are about efficiency (since that a major
> motivation for the current implementation, so slowing it down would be
> annoying), and whether we can just make this the standard behaviour
> rather than adding an option.
The current implementation is O(n) and I took care to keep it at that.
For the proposed term
2019 Mar 07
3
Ask for advice on exact requirements to fix #699 mixed CJK numbers
I am working on "#699 Better tokenisation of mixed CJK numbers",
and have implemented a partial patch of Chinese for this ticket.
Current code works well with special test cases and
all tests in xapian-core could still pass.
But I'm confused with exact requirements of the question,
for how much we could pay with performance on enabling more cases,
and if there are better methods to
2024 Jan 09
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek wrote:
> Removing the whole block will cause word-breaker to not correctly
> handle halfwidth Katakana, such as "??????????" which it would treat
> as a single term, whereas it should be two: ??????and ????).
>
> My pull request causes word-breaker to only handle halfwidth Katakana
> and Hangul codepoints as
2011 Apr 07
1
GSOC 2011- CJK Support
Hello, erver one, I am Yongzhi Zhang, a chinese student.
I'm interested in CJK Support(also known as Chinese, Japanese, and Korean
Support),
I have 6 years experience in software development (c/C++ and java) .
I want to work on this project "CJK Support", I come from Beijing of china.
Chinese is my native language. This is my advantage for ?CJK Support? .
I have fixed a bug for
2016 Jul 29
3
Pull requests: CJK words and Snippet generator
Hi James,
thanks for the feedback.
On Thu, Jul 28, 2016, at 00:22, James Aylett wrote:
> This sounds great! I know sufficiently little about CJK that I won't
> try to comment on that at all :)
I've just opened a pull request for the CJK tokenizer:
https://github.com/xapian/xapian/pull/114
> I wonder if we can arrange suitable defaults to use your
> implementation with the
2006 Sep 27
3
Icon or CJK fonts in MENU TITLE, is that possible in the future ?
First I would like to say thank you to HPA for providing some really nice features in recently syslinux version.
About new functions, actually I have another radical idea, since we are in Asia, most of the users here they would like to see some local fonts for the syslinux/pxelinux menu. I am wondering is that possible, in the future, the syslinux/pxelinux menu can support CJK fonts or icon ?
2016 Aug 05
2
Pull requests: CJK words and Snippet generator
On Thu, Aug 4, 2016, at 15:08, James Aylett wrote:
> On Wed, Aug 03, 2016 at 08:17:05PM +0200, rsto at paranoia.at wrote:
> > I'll notify you when the CJK pull request passes Travis.
>
> That's great, thanks!
Alright, after lots of fiddling with .travis.yml I finally made the pull
request build on Travis' trusty image:
https://github.com/xapian/xapian/pull/114
I have
2016 Aug 18
2
Pull requests: CJK words and Snippet generator
Hi,
On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote:
> The CJK word segmentation and snippet pull requests both pass Travis
> since middle/end of last week. Did you find time to look at them?
just checking in if you found time to look at the PRs? It'd be nice to
know a tentative timeline, so I can plan if to build next features on
top of our local fork or the upstream PRs.
2008 Oct 12
2
RFC: Kerning, postscript() and pdf()
Ei-ji Nakama has pointed out (from another Japanese user, I believe) that
postscript() and pdf() have not been handling kerning correctly, and this
is a request for opinions about how we should correct it.
Kerning is the adjustment of the spacing between letters from their
natural width, so that for example 'Yo' is usually typeset with the o
closer to the Y than 'Yl' would be.
2006 May 09
3
remove Punctuation characters
Hi,
I want to remove all punctuation characters in a string. I was trying it use
a regular expressions but it doesn't work.
Here is a sample os what i want:
str <- 'ABD - remove de punct, and dot characters.'
str <- gsub('[:punct:]','',str)
str
"'ABD remove de punct and dot characters"
is there any function that do this kind of thing?
Thanks to
2019 Mar 09
2
Ask for advice on exact requirements to fix #699 mixed CJK numbers
Thanks for your patience.
I'm still confused of what I should do next.
If it's not worth changing anything here as it's a rare case,
sorry for my PR to github before the reply,
maybe you need to close it on github.
For another case, should I optimize current code with
replacing set to a static array?
Or rollback current modification to cjk-tokenizer and
try to do some work with the
2024 Jan 04
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
I think I found a bug in Xapian 1.5 when using FLAG_WORD_BREAKS for input that contains characters in Unicode Halfwidth and Fullwidth Forms (https://unicode.org/charts/PDF/UFF00.pdf).
Since I am undecided yet if and how to fix this in Xapian I haven't come up with a pull request. Because trac currently is offline, I could not file a bug. I hope it's OK to post my analysis here first,
2011 Oct 22
3
Sweave, cairo_pdf, CJK, ghostscript
I have had some fun in the last few days trying to put together an annotated map of China with R and some public GIS data:
http://sourceforge.net/projects/outmodedbonsai/files/snpMatrix%20next/1.17.7.11/China_Choropleth_Maps.pdf/download
It is done, and rather nice... there are a few issues:
- the default pdf() device cannot do CJK with embedded fonts - and cairo_pdf() is not hooked up to
2007 Sep 14
1
Allowed punctuation in samba filenames?
hi
I've run into problems moving stuff from one ext3 filesystem to another
via samba. It seems the problem is due to some punctuation not being
allowed in paths/filenames.
Can anyone tell me where I can find a definitive list of allowed /
illegal characters?
Thanks in advance
Matt
--
==================================
Dr Matthew Studley
Lecturer : Robotics
Bristol Robotics Laboratory
2007 Jun 05
7
Chinese, Japanese, Korean Tokenizer.
Hi,
I am looking for Chinese Japanese and Korean tokenizer that could can
be use to tokenize terms for CJK languages. I am not very familiar
with these languages however I think that these languages contains one
or more words in one symbol which it make more difficult to tokenize
into searchable terms.
Lucene has CJK Tokenizer ... and I am looking around if there is some
open source that we
2016 Aug 03
2
Pull requests: CJK words and Snippet generator
On Wed, Aug 3, 2016, at 19:26, James Aylett wrote:
> On Wed, Aug 03, 2016 at 06:54:32PM +0200, rsto at paranoia.at wrote:
> > Oddly enough, the pull request causes Travis to break for clang but not
> > for gcc [1]. That's because the clang build process fails for the test
> > 'querypairwise1' [2], which AFAIK I didn't touch at all. Is that a
> > known