Hi all, I'm getting the following error from substring:> substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)Error in substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) : invalid multibyte string at '<e4>gel-A<6b>iyoshi' Is that normal / intended? I've tried setting the Encoding/locale to Latin-1/UTF-8 but that does not help. nchar gives me something similar> nchar("<I>Jens Oehlschl\xe4gel-Akiyoshi")Error in nchar("<I>Jens Oehlschl\xe4gel-Akiyoshi") : invalid multibyte string, element 1 I find it strange that substr/nchar give an error but regexpr works for telling me the length:> regexpr(".*", "<I>Jens Oehlschl\xe4gel-Akiyoshi")[1] 1 attr(,"match.length") [1] 29 Is that inconsistency normal/intended? btw this example comes from our very own list:> readLines("https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html")[28] [1] "<I>Jens Oehlschl\xe4gel-Akiyoshi" Best, Toby [[alternative HTML version deleted]]
On Fri, 26 Jun 2020 15:57:06 -0700 Toby Hocking <tdhock5 at gmail.com> wrote:>invalid multibyte string at '<e4>gel-A<6b>iyoshi'>https://stat.ethz.ch/pipermail/r-devel/1999-November/author.htmlThe server says that the text is UTF-8: curl -sI \ https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \ grep Content-Type # Content-Type: text/html; charset=UTF-8 But it's not, at least not all of it. If you ask readLines to mark the text as Latin-1, you get Jens Oehlschl?gel-Akiyoshi without the mojibake and invalid multi-byte characters: x <- readLines( 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html', encoding = 'latin1' )[28] substr(x, 1, 100) # [1] "<I>Jens Oehlschl?gel-Akiyoshi" The behaviour we observe when encoding = 'latin1' is not specified results from returned lines having "unknown" encoding. The substr() implementation tries to interpret such strings according to multi-byte C locale rules (using mbrtowc(3)). On my system (yours too, probably, if it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8, and this Latin-1 string does not result in valid code points when decoded as UTF-8. -- Best regards, Ivan
Thanks for the quick response Ivan. readLines with encoding='latin1' works for me (on Ubuntu). However I was more concerned with the inconsistency in results between substr and regexpr. I was expecting that if one of them errors because of an unknown encoding then the other should as well. Even better, if regexpr works, why shouldn't substr work as well? Incidentally the analogous stringi function stri_sub works fine in this case:> stringi::stri_sub("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)[1] "<I>Jens Oehlschl\xe4gel-Akiyoshi" But the stringi analog to nchar gives a similar warning:> stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi")[1] NA Warning message: In stringi::stri_length("<I>Jens Oehlschl\xe4gel-Akiyoshi") : invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8() On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov <krylov.r00t at gmail.com> wrote:> On Fri, 26 Jun 2020 15:57:06 -0700 > Toby Hocking <tdhock5 at gmail.com> wrote: > > >invalid multibyte string at '<e4>gel-A<6b>iyoshi' > > >https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html > > The server says that the text is UTF-8: > > curl -sI \ > https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \ > grep Content-Type > # Content-Type: text/html; charset=UTF-8 > > But it's not, at least not all of it. If you ask readLines to mark > the text as Latin-1, you get Jens Oehlschl?gel-Akiyoshi without the > mojibake and invalid multi-byte characters: > > x <- readLines( > 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html', > encoding = 'latin1' > )[28] > substr(x, 1, 100) > # [1] "<I>Jens Oehlschl?gel-Akiyoshi" > > The behaviour we observe when encoding = 'latin1' is not specified > results from returned lines having "unknown" encoding. The substr() > implementation tries to interpret such strings according to multi-byte C > locale rules (using mbrtowc(3)). On my system (yours too, probably, if > it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8, > and this Latin-1 string does not result in valid code points when > decoded as UTF-8. > > -- > Best regards, > Ivan >[[alternative HTML version deleted]]
Possibly Parallel Threads
- Error in substring: invalid multibyte string
- Error in substring: invalid multibyte string
- long character data
- Bug: time complexity of substring is quadratic as string size and number of substrings increases
- Bug: time complexity of substring is quadratic as string size and number of substrings increases