thr3ads.net - R devel - [Rd] Unicode whitespace [Jan 2008]

If this information is useful, please help other people find it:
Share via:

hadley wickham

2008-Jan-04 18:13 UTC

[Rd] Unicode whitespace

It would be nice if R ignored more unicode white space characters.
For example, if I have  "\u2028" in a command (which I get from a
line-break in keynote) I get the following error:
> qplot(carat, price, data = diamonds,   colour=clarity)
Error: unexpected input in "qplot(carat, price, data = diamonds, ?"

And occasionally have such problems when copying and pasting from
emails as well.

Wikipedia lists the following codepoints as whitespace (I'm sure there
is a more definitive reference but I could not find one with some
quick googling):

U0009-U000D (Control characters, containing TAB, CR and LF)
U0020 SPACE
U0085 NEL
U00A0 NBSP
U1680 OGHAM SPACE MARK
U180E MONGOLIAN VOWEL SEPARATOR
U2000-U200A (different sorts of spaces)
U2028 LSP
U2029 PSP
U202F NARROW NBSP
U205F MEDIUM MATHEMATICAL SPACE
U3000 IDEOGRAPHIC SPACE

would it be possible for R to treat these all in the same way? (Or
does it already but my R is misconfigured?)

Hadley

-- 
http://had.co.nz/

Prof Brian Ripley

2008-Jan-05 07:40 UTC

head link

[Rd] Unicode whitespace

I presume you want this only in a UTF-8 locale?

Currently this is done by

static int SkipSpace(void)
{
     int c;
     while ((c = xxgetc()) == ' ' || c == '\t' || c ==
'\f')
 	/* nothing */;
     return c;
}

in gram.c.  We could make use of isspace and its wide-char equivalent 
iswspace.  However:


- there is the perennial debate over whether \v is whitespace.

R-lang says

   Although not strictly tokens, stretches of whitespace characters
   (spaces and tabs) serve to delimit tokens in case of ambiguity,

which suggests it has a minimal view of whitespace.


- iswspace is often rather unreliable.  E.g. glibc says

     The wide character class "space" always contains  at  least  the 
space
     character and the control characters '\f', '\n',
'\r', '\t', '\v'.

and I think it usually does not contain other forms of spaces.  More 
seriously

     The  behaviour  of  iswspace()  depends on the LC_CTYPE category of the
     current locale.

so what is a space will depend on the encoding (hence my question about 
UTF-8).  And Ei-ji Makama was replaced iswspace on MacOS X, because 
apparently it is wrongly implemented.


- it would complicate the parser as look-ahead would be needed (you would 
need to read the next mbcs, check it it were whitespace and pushback if 
needed).  We do that elsewhere, though.


The only one of these 'spaces' I have much sympathy for is NBSP (which
is
also fairly easy to generate in CP1252).  It would be easy to add that.
Otherwise I am not convinced it is worth the work (and added uncertainty).



On Fri, 4 Jan 2008, hadley wickham wrote:
> It would be nice if R ignored more unicode white space characters.
> For example, if I have  "\u2028" in a command (which I get from a
> line-break in keynote) I get the following error:
>
>> qplot(carat, price, data = diamonds, ??  colour=clarity)
> Error: unexpected input in "qplot(carat, price, data = diamonds,
?"
>
> And occasionally have such problems when copying and pasting from
> emails as well.
>
> Wikipedia lists the following codepoints as whitespace (I'm sure there
> is a more definitive reference but I could not find one with some
> quick googling):
>
> U0009-U000D (Control characters, containing TAB, CR and LF)
Most of those are not normally considered whitespace.
> U0020 SPACE
> U0085 NEL
> U00A0 NBSP
> U1680 OGHAM SPACE MARK
> U180E MONGOLIAN VOWEL SEPARATOR
> U2000-U200A (different sorts of spaces)
> U2028 LSP
> U2029 PSP
> U202F NARROW NBSP
> U205F MEDIUM MATHEMATICAL SPACE
> U3000 IDEOGRAPHIC SPACE
>
> would it be possible for R to treat these all in the same way? (Or
> does it already but my R is misconfigured?)
>
> Hadley
>
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maybe Matching Threads

Search for more maybe matching threads

R devel - Jan 2008 - Unicode whitespace

[Rd] Unicode whitespace

[Rd] Unicode whitespace

Maybe Matching Threads