Spencer Graves
2022-Jun-25 06:37 UTC
[Rd] as.Date (and strptime?) does not recognize " " as a blank
Hello, All: When is a space not a space? Consider the following: > (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018")) [1] " 2 Mar 2018" > as.Date(pblmDate, format='%e %b %Y') [1] NA > as.Date(' 2 Mar 2018', format='%e %b %Y') [1] "2018-03-02" Is this a feature or a bug? I can work around it, now that I know what it is, but it took me a few hours to diagnose. Thanks, Spencer Graves p.s. I got this from scraping a website with code that had worked for me roughly 20 months ago. I suspect that in the interim, someone probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
Spencer Graves
2022-Jun-25 06:47 UTC
[Rd] as.Date (and strptime?) does not recognize " " as a blank
p.s. Is there a way to get XML::readHTMLTable to automatically convert " " to a normal blank space? On 6/25/22 1:37 AM, Spencer Graves wrote:> Hello, All: > > > ????? When is a space not a space? > > > ????? Consider the following: > > > > (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018")) > [1] " 2 Mar 2018" > > as.Date(pblmDate, format='%e %b %Y') > [1] NA > > as.Date(' 2 Mar 2018', format='%e %b %Y') > [1] "2018-03-02" > > > ????? Is this a feature or a bug? > > > ????? I can work around it, now that I know what it is, but it took me > a few hours to diagnose. > > > ????? Thanks, > ????? Spencer Graves > > > p.s.? I got this from scraping a website with code that had worked for > me roughly 20 months ago.? I suspect that in the interim, someone > probably replaced ' 2 Mar 2018' with " 2 Mar 2018". > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Maxim Nazarov
2022-Jun-25 11:10 UTC
[Rd] as.Date (and strptime?) does not recognize " " as a blank
Hello,> When is a space not a space?I guess the answer is when it is a non-breaking one?.. We can observe: > charToRaw(textutils::HTMLdecode(" ")) [1] c2 a0 > charToRaw(" ") [1] 20 So one can argue that everything works correctly - `textutils` function converts HTML's non-breaking space ' ' into R's non-breaking space '\xa0', while %e format of as.Date expects a 'normal' space. But this is obviously not user-friendly especially since both symbols are displayed the same way on the console. So your options might be to either: * manually change all 'weird' spaces into normal ones with something like gsub("\\h", " ", ..., perl = TRUE) - for the list of other weird spaces see https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes * persuade textutils author to change into a normal space (they seem to be working with a simple lookup table - https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466) * persuade R-Core (or submit a PR) to relax expectations of as.Date/strptime Kind regards, Maxim Nazarov ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves spencer.graves at prodsyse.com wrote:> Hello, All: > > > When is a space not a space? > > > Consider the following: > > > > (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018")) > [1] " 2 Mar 2018" > > as.Date(pblmDate, format='%e %b %Y') > [1] NA > > as.Date(' 2 Mar 2018', format='%e %b %Y') > [1] "2018-03-02" > > > Is this a feature or a bug? > > > I can work around it, now that I know what it is, but it took me a > few hours to diagnose. > > > Thanks, > Spencer Graves > > > p.s. I got this from scraping a website with code that had worked for > me roughly 20 months ago. I suspect that in the interim, someone > probably replaced ' 2 Mar 2018' with " 2 Mar 2018". > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel