Scott Kostyshak
2018-May-04 20:47 UTC
[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error
I have very little knowledge about file encodings and would like to learn more. I've read the following pages to learn more: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv https://developer.r-project.org/Encodings_and_R.html The last one, in particular, has been very helpful. I would be interested in any further references that you suggest. I attach a file that reproduces the issue I would like to learn more about. I do not know if the file encoding will be correctly preserved through email, so I also provide the file (temporarily) on Dropbox here: https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0 The file gives an error when using "source()" with the argument echo = TRUE: > source("encoding_export_issue.R", echo = TRUE) Error in nchar(dep, "c") : invalid multibyte string, element 1 In addition: Warning message: In grepl("^[[:blank:]]*$", dep[1L]) : input string 1 is invalid in this locale The problem comes from the "?" character in the .R file. The file appears to be encoded as "iso-8859-1": $ file --mime-encoding encoding_export_issue.R encoding_export_issue.R: iso-8859-1 Note that for me: > getOption("encoding") [1] "native.enc" so "native.enc" is used for the "encoding" argument of source(). The following two calls succeed: > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown") > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1") Is this file a valid "iso-8859-1" encoded file? Why does source() fail in the case of encoding set to "native.enc"? Is it because of the settings to UTF-8 in my locale (see info on my system at the bottom of this email). I'm guessing it would be a bad idea to put options(encoding = "unknown") in my .Rprofile, because it is difficult to always correctly guess the encoding of files? Is there a reason why setting it to "unknown" would lead to more problems than leaving it set to "native.enc"? I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below is my session info and locale info for my system with the 3.4.3 version:> sessionInfo()R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.3> Sys.getlocale()[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C" Thanks for your time, Scott P.S. Note that I had posted this question to r-devel, which was the incorrect choice. For archival purposes, I reference the thread here: https://www.mail-archive.com/search?l=mid&q=20180501185750.445oub53vcdnyyyx%40steph -- Scott Kostyshak Assistant Professor of Economics University of Florida https://people.clas.ufl.edu/skostyshak/ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: encoding_export_issue.R URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180504/45bab4a0/attachment.ksh>
Ista Zahn
2018-May-04 22:58 UTC
[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error
On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak <skostyshak at ufl.edu> wrote:> I have very little knowledge about file encodings and would like to > learn more. > > I've read the following pages to learn more: > > http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html > https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv > https://developer.r-project.org/Encodings_and_R.html > > The last one, in particular, has been very helpful. I would be > interested in any further references that you suggest. > > I attach a file that reproduces the issue I would like to learn more > about. I do not know if the file encoding will be correctly preserved > through email, so I also provide the file (temporarily) on Dropbox here: > > https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0 > > The file gives an error when using "source()" with the > argument echo = TRUE: > > > source("encoding_export_issue.R", echo = TRUE) > Error in nchar(dep, "c") : invalid multibyte string, element 1 > In addition: Warning message: > In grepl("^[[:blank:]]*$", dep[1L]) : > input string 1 is invalid in this locale > > The problem comes from the "?" character in the .R file. The file > appears to be encoded as "iso-8859-1": > > $ file --mime-encoding encoding_export_issue.R > encoding_export_issue.R: iso-8859-1 > > Note that for me: > > > getOption("encoding") > [1] "native.enc" > > so "native.enc" is used for the "encoding" argument of source(). > > The following two calls succeed: > > > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown") > > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1") > > Is this file a valid "iso-8859-1" encoded file?The one you attached is not. The one linked to in dropbox is. Why does source() fail> in the case of encoding set to "native.enc"? Is it because of the > settings to UTF-8 in my locale (see info on my system at the bottom of > this email).Yes.> > I'm guessing it would be a bad idea to put > > options(encoding = "unknown") > > in my .Rprofile, because it is difficult to always correctly guess the > encoding of files?My guess is that the issue is less about the difficulty of guessing the encoding, and more about the time it takes to do so. That's not particularly relevant for the "source" function, but the encoding option is used by many of the file IO functions in R and so has implications well beyond the behavior of "source". Is there a reason why setting it to "unknown" would> lead to more problems than leaving it set to "native.enc"?It depends on what you are actually doing. If you are on a UTF-8 locale and working exclusively with UTF-8 files, setting options(encoding = "unknown") will just slow down your file IO by checking for the encoding every time.> > I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below > is my session info and locale info for my system with the 3.4.3 version: > >> sessionInfo() > R version 3.4.3 (2017-11-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 16.04.3 LTS > > Matrix products: default > BLAS: /usr/lib/libblas/libblas.so.3.6.0 > LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.3 > >> Sys.getlocale() > [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C" > > Thanks for your time, > > Scott > > P.S. Note that I had posted this question to r-devel, which was the > incorrect choice. For archival purposes, I reference the thread here: > > https://www.mail-archive.com/search?l=mid&q=20180501185750.445oub53vcdnyyyx%40steph > > > -- > Scott Kostyshak > Assistant Professor of Economics > University of Florida > https://people.clas.ufl.edu/skostyshak/ > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Scott Kostyshak
2018-May-05 20:52 UTC
[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error
On Fri, May 04, 2018 at 10:58:26PM +0000, Ista Zahn wrote:> On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak <skostyshak at ufl.edu> wrote: > > I have very little knowledge about file encodings and would like to > > learn more. > > > > I've read the following pages to learn more: > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=PSqR5opjnHspAeM6Edm1ddsaY3ok1bnV-t6W4MKtVCM&e> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=1M6pNfwFR5uG5DkSAHPpXZKYETCiwV1wsJxpew6lThY&e> > https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=hAF57aL9khHQ_2Ndars7qMO-FoqxnnmOiEDIprsllko&e> > > > The last one, in particular, has been very helpful. I would be > > interested in any further references that you suggest. > > > > I attach a file that reproduces the issue I would like to learn more > > about. I do not know if the file encoding will be correctly preserved > > through email, so I also provide the file (temporarily) on Dropbox here: > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=fGtYdB-U7ktXVFeniRudE-ZmxmCP3ZUfeLOvJ0AJwqs&e> > > > The file gives an error when using "source()" with the > > argument echo = TRUE: > > > > > source("encoding_export_issue.R", echo = TRUE) > > Error in nchar(dep, "c") : invalid multibyte string, element 1 > > In addition: Warning message: > > In grepl("^[[:blank:]]*$", dep[1L]) : > > input string 1 is invalid in this locale > > > > The problem comes from the "?" character in the .R file. The file > > appears to be encoded as "iso-8859-1": > > > > $ file --mime-encoding encoding_export_issue.R > > encoding_export_issue.R: iso-8859-1 > > > > Note that for me: > > > > > getOption("encoding") > > [1] "native.enc" > > > > so "native.enc" is used for the "encoding" argument of source(). > > > > The following two calls succeed: > > > > > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown") > > > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1") > > > > Is this file a valid "iso-8859-1" encoded file? > > The one you attached is not. The one linked to in dropbox is. > > Why does source() fail > > in the case of encoding set to "native.enc"? Is it because of the > > settings to UTF-8 in my locale (see info on my system at the bottom of > > this email). > > Yes. > > > > > I'm guessing it would be a bad idea to put > > > > options(encoding = "unknown") > > > > in my .Rprofile, because it is difficult to always correctly guess the > > encoding of files? > > My guess is that the issue is less about the difficulty of guessing > the encoding, and more about the time it takes to do so. That's not > particularly relevant for the "source" function, but the encoding > option is used by many of the file IO functions in R and so has > implications well beyond the behavior of "source".Ah I did not think about this possibility. Makes sense.> > Is there a reason why setting it to "unknown" would > > lead to more problems than leaving it set to "native.enc"? > > It depends on what you are actually doing. If you are on a UTF-8 > locale and working exclusively with UTF-8 files, setting > options(encoding = "unknown") will just slow down your file IO by > checking for the encoding every time.Good to know. Thank you for your response, Ista. Scott -- Scott Kostyshak Assistant Professor of Economics University of Florida https://people.clas.ufl.edu/skostyshak/> > > > I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below > > is my session info and locale info for my system with the 3.4.3 version: > > > >> sessionInfo() > > R version 3.4.3 (2017-11-30) > > Platform: x86_64-pc-linux-gnu (64-bit) > > Running under: Ubuntu 16.04.3 LTS > > > > Matrix products: default > > BLAS: /usr/lib/libblas/libblas.so.3.6.0 > > LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > loaded via a namespace (and not attached): > > [1] compiler_3.4.3 > > > >> Sys.getlocale() > > [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C" > > > > Thanks for your time, > > > > Scott > > > > P.S. Note that I had posted this question to r-devel, which was the > > incorrect choice. For archival purposes, I reference the thread here: > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_search-3Fl-3Dmid-26q-3D20180501185750.445oub53vcdnyyyx-2540steph&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=rWb2owVxdai483O9Lb6Al-ATizQX1zeAinXMeWweFLE&e> > > > > > -- > > Scott Kostyshak > > Assistant Professor of Economics > > University of Florida > > https://people.clas.ufl.edu/skostyshak/ > > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=b5inw8dJraPVuT9OF5_XOpqG7eM9RNLAk7HYGyl-hQY&e> > PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=96nY2mWP-VjDhL-gH0cMDo4jyfg1ZKHGkBXif_fmWTM&e> > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=b5inw8dJraPVuT9OF5_XOpqG7eM9RNLAk7HYGyl-hQY&e> PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=neJ42wVqpDzuvOKMBML6-HnbH0l0aXpb0ZUFWoGb-Bo&m=yaDPpePO4lxR7-PBircARZlFh-GVyi5sTNtjTr_JZ7U&s=96nY2mWP-VjDhL-gH0cMDo4jyfg1ZKHGkBXif_fmWTM&e> and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error
- source(echo = TRUE) with a iso-8859-1 encoded file gives an error
- source(echo = TRUE) with a iso-8859-1 encoded file gives an error
- [patch] add sanity checks to quantile()
- Mention the case of logical(0) in ?stopifnot