Spencer Graves
2022-Jun-25 13:13 UTC
[Rd] as.Date (and strptime?) does not recognize " " as a blank
Hi, Maxim et al.: On 6/25/22 6:10 AM, Maxim Nazarov wrote:> Hello, > >> When is a space not a space? > I guess the answer is when it is a non-breaking one?.. > > We can observe: > > charToRaw(textutils::HTMLdecode(" ")) > [1] c2 a0 > > charToRaw(" ") > [1] 20 > So one can argue that everything works correctly - `textutils` function converts HTML's non-breaking space ' ' into R's non-breaking space '\xa0', while %e format of as.Date expects a 'normal' space. > But this is obviously not user-friendly especially since both symbols are displayed the same way on the console. > So your options might be to either: > * manually change all 'weird' spaces into normal ones with something like gsub("\\h", " ", ..., perl = TRUE) - for the list of other weird spaces see https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes > * persuade textutils author to change into a normal space (they seem to be working with a simple lookup table - https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466) > * persuade R-Core (or submit a PR) to relax expectations of as.Date/strptime >Thanks for the reply. Since "this is obviously not user-friendly", as you noted, I felt a need to bring it to the attention of this group, and let them decide what if anything they would want to do about it. In any event, I found a fix for my immediate problem. It's not as elegant as yours, but it works. Best Wishes, Spencer> Kind regards, > Maxim Nazarov > > ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves spencer.graves at prodsyse.com wrote: > >> Hello, All: >> >> >> When is a space not a space? >> >> >> Consider the following: >> >> >>> (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018")) >> [1] " 2 Mar 2018" >>> as.Date(pblmDate, format='%e %b %Y') >> [1] NA >>> as.Date(' 2 Mar 2018', format='%e %b %Y') >> [1] "2018-03-02" >> >> >> Is this a feature or a bug? >> >> >> I can work around it, now that I know what it is, but it took me a >> few hours to diagnose. >> >> >> Thanks, >> Spencer Graves >> >> >> p.s. I got this from scraping a website with code that had worked for >> me roughly 20 months ago. I suspect that in the interim, someone >> probably replaced ' 2 Mar 2018' with " 2 Mar 2018". >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel
Prof Brian Ripley
2022-Jul-07 11:59 UTC
[Rd] as.Date (and strptime?) does not recognize " " as a blank
There is some misunderstanding here. The space is part of the format specified by SG to as.Date(), which passes it to strptime(). So SG asked to match a space and complained that a different character is not matched! Reading the documentation of strptime shows ?%n? Newline on output, arbitrary whitespace on input. ?%t? Tab on output, arbitrary whitespace on input. so one might hope that one could use those to specify whitespace instead of ASCII space in the format. But unfortunately whether a Unicode no-break space (U+00A0) is whitespace is a matter of opinion -- for example the PCRE author changed his a few years back. We don't have a reproducible example, but my attempt at reproduction suggests that U+00A0 is not regarded as whitespace on the system I used. We know this to be platform-specific (it uses the C function iswspace): glibc does not regard this as whitespace and the replacement functions used by R on macOS and Windows have followed suit. In short, ASCII space matches only itself, and the interpretation of 'blank' (in regexps) or 'whitespace' (in strptime or regexps) is platform-specific and liable to change. On 25/06/2022 14:13, Spencer Graves wrote:> Hi, Maxim et al.: > > > On 6/25/22 6:10 AM, Maxim Nazarov wrote: >> Hello, >> >>> When is a space not a space? >> I guess the answer is when it is a non-breaking one?.. >> >> We can observe: >> ? > charToRaw(textutils::HTMLdecode(" ")) >> ? [1] c2 a0 >> ? > charToRaw(" ") >> ? [1] 20 >> So one can argue that everything works correctly - `textutils` >> function converts HTML's non-breaking space ' ' into R's >> non-breaking space '\xa0', while %e format of as.Date expects a >> 'normal' space. >> But this is obviously not user-friendly especially since both symbols >> are displayed the same way on the console. >> So your options might be to either: >> ? * manually change all 'weird' spaces into normal ones with something >> like gsub("\\h", " ", ..., perl = TRUE) - for the list of other weird >> spaces see >> https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes >> ? * persuade textutils author to change into a normal space >> (they seem to be working with a simple lookup table - >> https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466) >> >> ? * persuade R-Core (or submit a PR) to relax expectations of >> as.Date/strptime >> > > ????? Thanks for the reply.? Since "this is obviously not > user-friendly", as you noted, I felt a need to bring it to the attention > of this group, and let them decide what if anything they would want to > do about it. > > > ????? In any event, I found a fix for my immediate problem.? It's not > as elegant as yours, but it works. > > ????? Best Wishes, > ????? Spencer > > > > >> Kind regards, >> Maxim Nazarov >> >> ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves >> spencer.graves at prodsyse.com wrote: >> >>> Hello, All: >>> >>> >>> ????? When is a space not a space? >>> >>> >>> ????? Consider the following: >>> >>> >>>> (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018")) >>> [1] " 2 Mar 2018" >>>> as.Date(pblmDate, format='%e %b %Y') >>> [1] NA >>>> as.Date(' 2 Mar 2018', format='%e %b %Y') >>> [1] "2018-03-02" >>> >>> >>> ????? Is this a feature or a bug? >>> >>> >>> ????? I can work around it, now that I know what it is, but it took me a >>> few hours to diagnose. >>> >>> >>> ????? Thanks, >>> ????? Spencer Graves >>> >>> >>> p.s.? I got this from scraping a website with code that had worked for >>> me roughly 20 months ago.? I suspect that in the interim, someone >>> probably replaced ' 2 Mar 2018' with " 2 Mar 2018". >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford