thr3ads.net - R devel - [Rd] bug in strsplit? [May 2009]

If this information is useful, please help other people find it:
Share via:

Wacek Kusnierczyk

2009-May-29 07:49 UTC

[Rd] bug in strsplit?

src/main/character.c:435-438 (do_strsplit) contains the following code:

    for (i = 0; i < tlen; i++)
        if (getCharCE(STRING_ELT(tok, 0)) == CE_UTF8) use_UTF8 = TRUE;
    for (i = 0; i < len; i++)
        if (getCharCE(STRING_ELT(x, 0)) == CE_UTF8) use_UTF8 = TRUE;

since both loops iterate over loop-invariant expressions and statements,
either the loops are redundant, or the fixed index '0' was meant to
actually be the variable i.  i guess it's the latter, hence 'bug?'
in
the subject.

it also appears that if *any* element of tok (or x) positively passes
the test, use_UTF8 is set to TRUE;  in such a case, further checks make
no sense.  the following rewrite cuts the inessential computation:

    for (i = 0; i < tlen; i++)
        if (getCharCE(STRING_ELT(tok, i)) == CE_UTF8) {
            use_UTF8 = TRUE;
            break; }
    for (i = 0; i < len; i++)
        if (getCharCE(STRING_ELT(x, i)) == CE_UTF8) {
            use_UTF8 = TRUE;
            break; }
            
since the pattern is repetitive, the following generic approach would
help (and the macro could possibly be reused in other places):

#define CHECK_CE(CHARACTER, LENGTH, USEUTF8) \
    for (i = 0; i < (LENGTH); i++) \
        if (getCharCE(STRING_ELT((CHARACTER), i)) == CE_UTF8) { \
            (USEUTF8) = TRUE; \
            break; }
CHECK_CE(tok, tlen, use_UTF8)
CHECK_CE(x, len, use_UTF8)
            
if you like it, i can provide a patch.

vQ

R devel - May 2009 - bug in strsplit?

[Rd] bug in strsplit?