>>>>> peter dalgaard >>>>> on Sun, 3 Jun 2018 23:51:24 +0200 writes:> Looks like this actually comes from readLines(), nothing > to do with source() as such: In current R-devel (still): >> f <- file("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") >> readLines(f) > character(0) >> close(f) >> f <- file("http://home.versanet.de/~s-berman/source2.R") >> readLines(f) > [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" > [3] "}" > -pd and that's not even readLines(), but rather how exactly the connection is defined [even in your example above] > urlR <- "http://home.versanet.de/~s-berman/source2.R" > readLines(urlR, encoding="UTF-8") [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" [3] "}" > f <- file(urlR, encoding = "UTF-8") > readLines(f) character(0) and the same behavior with scan() instead of readLines() :> scan(urlR,"") # worksRead 7 items [1] "source.test2" "<-" "function()" "{" [5] "print(\"Non-ascii:" "????\")" "}"> scan(f,"") # failsRead 0 items character(0)>So it seems as if the bug is in the file() [or url()] C code .. But then we also have to consider Windows .. where I think most changes have happened during the R-3.4.4 --> R-3.5.0 transition. >> On 2 Jun 2018, at 15:37 , Stephen Berman <stephen.berman at gmx.net> wrote: >> >> In R 3.5.0 using the `encoding' argument of source() prevents loading >> files from the internet; without the `encoding' argument files can be >> loaded from the internet, but if they contain non-ascii characters, >> these are not correctly displayed under MS-Windows (but they are >> correctly displayed under GNU/Linux). With R 3.4.{2,3,4} there is no >> such problem: using `encoding' the files are loaded and non-ascii >> characters are correctly displayed under MS-Windows (but not without >> `encoding'). Here is a transcript from R 3.5.0 under GNU/Linux (the >> URLs are real, in case anyone wants to try and reproduce the problem): >> >>> ls() >> character(0) >>> source("http://home.versanet.de/~s-berman/source1.R", encoding="UTF-8") >>> ls() >> character(0) >>> source("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") >>> ls() >> character(0) >>> source("http://home.versanet.de/~s-berman/source1.R") >>> ls() >> [1] "source.test1" >>> source("http://home.versanet.de/~s-berman/source2.R") >>> ls() >> [1] "source.test1" "source.test2" >>> source.test1() >> [1] "This is a test." >>> source.test2() >> [1] "Non-ascii: ????" >> >> (The four non-ascii characters are Unicode 0xE4, 0xF6, 0xFC, 0xDF.) >> With 3.5.0 under MS-Windows, the transcript is the same except for the >> display of the last output, which is this: >> >> [1] "Non-ascii: ????????" >> >> (Here there are eight non-ascii characters, which display the Unicode >> decompositions of the four non-ascii characters above.) >> >> Here is a transcript from R 3.4.3 under MS-Windows (under GNU/Linux it's >> the same except that the non-ascii characters are also correctly >> displayed even without the `encoding' argument): >> >>> ls() >> character(0) >>> source("http://home.versanet.de/~s-berman/source1.R") >>> ls() >> [1] "source.test1" >>> source("http://home.versanet.de/~s-berman/source2.R") >>> ls() >> [1] "source.test1" "source.test2" >>> source.test1() >> [1] "This is a test." >>> source.test2() >> [1] "Non-ascii: ????????" >>> rm(source.test2) >>> ls() >> [1] "source.test1" >>> source("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") >>> ls() >> [1] "source.test1" "source.test2" >>> source.test2() >> [1] "Non-ascii: ????" >> >> I did a web search but didn't find any reports of this issue, nor did I >> see any relevant entry in the 3.5.0 NEWS, so this looks like a bug, but >> maybe I've overlooked something. I'd be grateful for any enlightenment. >> >> Steve Berman >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
On Mon, 4 Jun 2018 10:44:11 +0200 Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> peter dalgaard >>>>>> on Sun, 3 Jun 2018 23:51:24 +0200 writes: > > > Looks like this actually comes from readLines(), nothing > > to do with source() as such: In current R-devel (still): > > >> f <- file("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") > >> readLines(f) > > character(0) > >> close(f) > >> f <- file("http://home.versanet.de/~s-berman/source2.R") > >> readLines(f) > > [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" > > [3] "}" > > > -pd > > and that's not even readLines(), but rather how exactly the > connection is defined [even in your example above] > > > urlR <- "http://home.versanet.de/~s-berman/source2.R" > > readLines(urlR, encoding="UTF-8") > [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" > [3] "}" > > f <- file(urlR, encoding = "UTF-8") > > readLines(f) > character(0) > > and the same behavior with scan() instead of readLines() : > >> scan(urlR,"") # works > Read 7 items > [1] "source.test2" "<-" "function()" "{" > [5] "print(\"Non-ascii:" "????\")" "}" >> scan(f,"") # fails > Read 0 items > character(0) >> > > So it seems as if the bug is in the file() [or url()] C code ..Yes, the problem seems to be restricted to loading files from a (non-local) URL; i.e. this works fine on my computer: > source("file:///home/steve/prog/R/source2.R", encoding="UTF-8") Also, I noticed this works too: > read.table("http://home.versanet.de/~s-berman/table2", encoding="UTF-8", skip=1) where (if I read the source correctly) using `skip=1' makes read.table() call readLines(). (The read.table() invocation also works without `skip'.)> But then we also have to consider Windows .. where I think most changes have > happened during the R-3.4.4 --> R-3.5.0 transition.Yes, please. I need (or at least it would be convenient) to be able to load R code containing non-ascii characters from the web under MS-Windows. Steve Berman
It's not Windows-specific, though. My example was on a Mac... I hope we can sort this out before 3.5.1. -pd> On 4 Jun 2018, at 10:44 , Martin Maechler <maechler at stat.math.ethz.ch> wrote: > > So it seems as if the bug is in the file() [or url()] C code .. > But then we also have to consider Windows .. where I think most changes have > happened during the R-3.4.4 --> R-3.5.0 transition. > > >>> On 2 Jun 2018, at 15:37 , Stephen Berman <stephen.berman at gmx.net> wrote: >>> >>> In R 3.5.0 using the `encoding' argument of source() prevents loading >>> files from the internet; without the `encoding' argument files can be >>> loaded from the internet, but if they contain non-ascii characters, >>> these are not correctly displayed under MS-Windows (but they are >>> correctly displayed under GNU/Linux). With R 3.4.{2,3,4} there is no >>> such problem: using `encoding' the files are loaded and non-ascii >>> characters are correctly displayed under MS-Windows (but not without >>> `encoding'). Here is a transcript from R 3.5.0 under GNU/Linux (the >>> URLs are real, in case anyone wants to try and reproduce the problem): >>> >>>> ls() >>> character(0) >>>> source("http://home.versanet.de/~s-berman/source1.R", encoding="UTF-8") >>>> ls() >>> character(0) >>>> source("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") >>>> ls() >>> character(0) >>>> source("http://home.versanet.de/~s-berman/source1.R") >>>> ls() >>> [1] "source.test1" >>>> source("http://home.versanet.de/~s-berman/source2.R") >>>> ls() >>> [1] "source.test1" "source.test2" >>>> source.test1() >>> [1] "This is a test." >>>> source.test2() >>> [1] "Non-ascii: ????" >>> >>> (The four non-ascii characters are Unicode 0xE4, 0xF6, 0xFC, 0xDF.) >>> With 3.5.0 under MS-Windows, the transcript is the same except for the >>> display of the last output, which is this: >>> >>> [1] "Non-ascii: ????????" >>> >>> (Here there are eight non-ascii characters, which display the Unicode >>> decompositions of the four non-ascii characters above.) >>> >>> Here is a transcript from R 3.4.3 under MS-Windows (under GNU/Linux it's >>> the same except that the non-ascii characters are also correctly >>> displayed even without the `encoding' argument): >>> >>>> ls() >>> character(0) >>>> source("http://home.versanet.de/~s-berman/source1.R") >>>> ls() >>> [1] "source.test1" >>>> source("http://home.versanet.de/~s-berman/source2.R") >>>> ls() >>> [1] "source.test1" "source.test2" >>>> source.test1() >>> [1] "This is a test." >>>> source.test2() >>> [1] "Non-ascii: ????????" >>>> rm(source.test2) >>>> ls() >>> [1] "source.test1" >>>> source("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") >>>> ls() >>> [1] "source.test1" "source.test2" >>>> source.test2() >>> [1] "Non-ascii: ????" >>> >>> I did a web search but didn't find any reports of this issue, nor did I >>> see any relevant entry in the 3.5.0 NEWS, so this looks like a bug, but >>> maybe I've overlooked something. I'd be grateful for any enlightenment. >>> >>> Steve Berman >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Office: A 4.23 >> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > >-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On R 3.5.0 (Mac) The issue appears when using the default (libcurl) method and specifying the encoding Note that using method='internal' causes a segfault if used in conjunction with encoding. (and works when encoding is not set) urlR <- "http://home.versanet.de/~s-berman/source2.R" # works url_default <- url(urlR) scan(url_default, "") # Read 7 items # [1] "source.test2" "<-" "function()" "{" "print(\"Non-ascii:" "????\")" # [7] "}" url_default_en <- url(urlR, encoding = "UTF-8") scan(url_default_en, "") # Read 0 items # character(0) url_internal <- url(urlR, method = 'internal') scan(url_internal, "") # Read 7 items # [1] "source.test2" "<-" "function()" "{" "print(\"Non-ascii:" "????\")" # [7] "}" url_internal_en <- url(urlR, encoding = "UTF-8", method = 'internal') #scan(url_internal_en, "") #*** caught segfault *** # address 0x0, cause 'memory not mapped' url_libcurl <- url(urlR, method = 'libcurl') scan(url_libcurl, "") # Read 7 items # [1] "source.test2" "<-" "function()" "{" "print(\"Non-ascii:" "????\")" # [7] "}" url_libcurl_en <- url(urlR, encoding = "UTF-8", method = 'libcurl') scan(url_libcurl_en, "") # Read 0 items # character(0) Michael ________________________________________ From: R-devel [r-devel-bounces at r-project.org] on behalf of Stephen Berman [stephen.berman at gmx.net] Sent: Monday, 4 June 2018 7:26 PM To: Martin Maechler Cc: R-devel Subject: Re: [Rd] encoding argument of source() in 3.5.0 On Mon, 4 Jun 2018 10:44:11 +0200 Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> peter dalgaard >>>>>> on Sun, 3 Jun 2018 23:51:24 +0200 writes: > > > Looks like this actually comes from readLines(), nothing > > to do with source() as such: In current R-devel (still): > > >> f <- file("http://home.versanet.de/~s-berman/source2.R", encoding="UTF-8") > >> readLines(f) > > character(0) > >> close(f) > >> f <- file("http://home.versanet.de/~s-berman/source2.R") > >> readLines(f) > > [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" > > [3] "}" > > > -pd > > and that's not even readLines(), but rather how exactly the > connection is defined [even in your example above] > > > urlR <- "http://home.versanet.de/~s-berman/source2.R" > > readLines(urlR, encoding="UTF-8") > [1] "source.test2 <- function() {" " print(\"Non-ascii: ????\")" > [3] "}" > > f <- file(urlR, encoding = "UTF-8") > > readLines(f) > character(0) > > and the same behavior with scan() instead of readLines() : > >> scan(urlR,"") # works > Read 7 items > [1] "source.test2" "<-" "function()" "{"> [5] "print(\"Non-ascii:" "????\")" "}" >> scan(f,"") # fails > Read 0 items > character(0) >> > > So it seems as if the bug is in the file() [or url()] C code ..Yes, the problem seems to be restricted to loading files from a (non-local) URL; i.e. this works fine on my computer: > source("file:///home/steve/prog/R/source2.R", encoding="UTF-8") Also, I noticed this works too: > read.table("http://home.versanet.de/~s-berman/table2", encoding="UTF-8", skip=1) where (if I read the source correctly) using `skip=1' makes read.table() call readLines(). (The read.table() invocation also works without `skip'.)> But then we also have to consider Windows .. where I think most changes have > happened during the R-3.4.4 --> R-3.5.0 transition.Yes, please. I need (or at least it would be convenient) to be able to load R code containing non-ascii characters from the web under MS-Windows. Steve Berman ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __________________________________________________________________________________________________________ This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. Emails and attachments are monitored to ensure compliance with the NSW Ministry of health's Electronic Messaging Policy. __________________________________________________________________________________________________________ _______________________________________________________________________________________________________ Disclaimer: This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of the NSW Ministry of Health. _______________________________________________________________________________________________________ This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. Emails and attachments are monitored to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.