nakama at ki.rim.or.jp
2007-Jun-24 13:46 UTC
[Rd] problem gsub in the locale of CP932 and SJIS (PR#9751)
Full_Name: Ei-ji Nakama Version: R-2.5.0 OS: any Submission from: (NULL) (219.117.236.5) problem by operation of gsub in the locale of CP932 and SJIS. The inconvenient character code which used 0x5c after the first byte. --- R-2.5.0.orig/src/main/character.c 2007-04-03 11:05:05.000000000 +0900 +++ R-2.5.0/src/main/character.c 2007-06-24 22:31:06.000000000 +0900 @@ -986,6 +986,17 @@ char *p = repl; n = strlen(repl) - (regmatch[0].rm_eo - regmatch[0].rm_so); while (*p) { +#ifdef SUPPORT_MBCS + if(mbcslocale){ + int clen; + mbstate_t mb_st; + mbs_init(&mb_st); + if((clen = Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ + p+=clen; + continue; + } + } +#endif if (*p == '\\') { if ('1' <= p[1] && p[1] <= '9') { k = p[1] - '0'; @@ -1014,6 +1025,18 @@ int i, k; char *p = repl, *t = target; while (*p) { +#ifdef SUPPORT_MBCS + if(mbcslocale){ + int clen; + mbstate_t mb_st; + mbs_init(&mb_st); + if((clen = Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ + for ( i=0; i<clen; i++) + *t++ = *p++; + continue; + } + } +#endif if (*p == '\\') { if ('1' <= p[1] && p[1] <= '9') { k = p[1] - '0';
ripley at stats.ox.ac.uk
2007-Jun-25 09:24 UTC
[Rd] problem gsub in the locale of CP932 and SJIS (PR#9751)
Thanks for this. I don't think the patch is quite right. As I understand it, mbstate_t should be initialized at the start of the string, not before each character, and that is what is done in the rest of R. Also, do you have an example I can use to test the patch, please? R 2.5.0 is now in code freeze and I don't think this is vital for that. On Sun, 24 Jun 2007, nakama at ki.rim.or.jp wrote:> Full_Name: Ei-ji Nakama > Version: R-2.5.0 > OS: any > Submission from: (NULL) (219.117.236.5) > > > problem by operation of gsub in the locale of CP932 and SJIS. > The inconvenient character code which used 0x5c after the first byte. > > --- R-2.5.0.orig/src/main/character.c 2007-04-03 11:05:05.000000000 +0900 > +++ R-2.5.0/src/main/character.c 2007-06-24 22:31:06.000000000 +0900 > @@ -986,6 +986,17 @@ > char *p = repl; > n = strlen(repl) - (regmatch[0].rm_eo - regmatch[0].rm_so); > while (*p) { > +#ifdef SUPPORT_MBCS > + if(mbcslocale){ > + int clen; > + mbstate_t mb_st; > + mbs_init(&mb_st); > + if((clen = Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ > + p+=clen; > + continue; > + } > + } > +#endif > if (*p == '\\') { > if ('1' <= p[1] && p[1] <= '9') { > k = p[1] - '0'; > @@ -1014,6 +1025,18 @@ > int i, k; > char *p = repl, *t = target; > while (*p) { > +#ifdef SUPPORT_MBCS > + if(mbcslocale){ > + int clen; > + mbstate_t mb_st; > + mbs_init(&mb_st); > + if((clen = Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ > + for ( i=0; i<clen; i++) > + *t++ = *p++; > + continue; > + } > + } > +#endif > if (*p == '\\') { > if ('1' <= p[1] && p[1] <= '9') { > k = p[1] - '0'; > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
nakama at ki.rim.or.jp
2007-Jun-25 10:08 UTC
[Rd] problem gsub in the locale of CP932 and SJIS (PR#9751)
Thanks. As for mbs_init, the outside of the loop is desirable. probrem code is.> gsub("A","=A5u30bd=A5u8868","A")euc-jp and utf-8 moves without a problem.> Sys.getlocale("LC_CTYPE") # SHIFT_JIS system.[1] "ja_JP.SJIS"> charToRaw("=A5u30bd=A5u8868") # The second byte is a char of 5c[1] 83 5c 95 5c 2007/6/25, Prof Brian Ripley <ripley at stats.ox.ac.uk>:> Thanks for this. > > I don't think the patch is quite right. As I understand it, mbstate_t > should be initialized at the start of the string, not before each > character, and that is what is done in the rest of R. > > Also, do you have an example I can use to test the patch, please? > > R 2.5.0 is now in code freeze and I don't think this is vital for that. > > > On Sun, 24 Jun 2007, nakama at ki.rim.or.jp wrote: > > > Full_Name: Ei-ji Nakama > > Version: R-2.5.0 > > OS: any > > Submission from: (NULL) (219.117.236.5) > > > > > > problem by operation of gsub in the locale of CP932 and SJIS. > > The inconvenient character code which used 0x5c after the first byte. > > > > --- R-2.5.0.orig/src/main/character.c 2007-04-03 11:05:05.000000000 +0900 > > +++ R-2.5.0/src/main/character.c 2007-06-24 22:31:06.000000000 +0900 > > @@ -986,6 +986,17 @@ > > char *p =3D repl; > > n =3D strlen(repl) - (regmatch[0].rm_eo - regmatch[0].rm_so); > > while (*p) { > > +#ifdef SUPPORT_MBCS > > + if(mbcslocale){ > > + int clen; > > + mbstate_t mb_st; > > + mbs_init(&mb_st); > > + if((clen =3D Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ > > + p+=3Dclen; > > + continue; > > + } > > + } > > +#endif > > if (*p =3D=3D '\\') { > > if ('1' <=3D p[1] && p[1] <=3D '9') { > > k =3D p[1] - '0'; > > @@ -1014,6 +1025,18 @@ > > int i, k; > > char *p =3D repl, *t =3D target; > > while (*p) { > > +#ifdef SUPPORT_MBCS > > + if(mbcslocale){ > > + int clen; > > + mbstate_t mb_st; > > + mbs_init(&mb_st); > > + if((clen =3D Mbrtowc(NULL, p, MB_CUR_MAX, &mb_st)) > 1){ > > + for ( i=3D0; i<clen; i++) > > + *t++ =3D *p++; > > + continue; > > + } > > + } > > +#endif > > if (*p =3D=3D '\\') { > > if ('1' <=3D p[1] && p[1] <=3D '9') { > > k =3D p[1] - '0'; > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > >--=20 EI-JI Nakama <nakama at ki.rim.or.jp> "\u4e2d\u9593\u6804\u6cbb" <nakama at ki.rim.or.jp>