Adam Obeng
2016-Jun-06 10:11 UTC
[Rd] readlines() truncates text file with Codepage 437 encoding
Hello r-devel, The attached Code page 437-encoded file contains 245 characters (including the final newline), but readLines only reads 242 of them:> test_text <- readLines(file('437__characters.txt', encoding='437'))Warning message: In readLines(file("437__characters.txt", : incomplete final line found on '437__characters.txt'> test_text[1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177 ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????"> nchar(test_text)[1] 242 You'll note that readLines hasn't read the final characters "??\n". # Diagnostics My best guess is that this is something to do with how readLines() determines when it has reached EOF, because of the following: - The file is terminated with an ASCII LF (0x0a), but R gives an 'incomplete final line found' warning. Note that in some implementations of Code page 437, 0x0a is interpreted as a graphical character rather than a control character, but this does not seem to be the problem here. The same problem occurs if the file ends with 0x0d or 0x0d 0x0a. - Adding seven or more characters to the end of the file makes it read correctly - Similarly, the file is read correctly if you remove three characters from anywhere in the file - The same issue seems to occur with reading files encoded in other DOS code pages # Additional information> sessionInfo()R version 3.2.3 (2015-12-10) Platform: x86_64-apple-darwin14.5.0 (64-bit) Running under: OS X 10.10.5 (Yosemite) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base The same behaviour occurs under R 2.15.1 on a Linux server. In case the attached file is somehow corrupted, here is a hexdump: 00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b ................ 00000010: 1c1d 1e1f 2021 2223 2425 2627 2829 2a2b .... !"#$%&'()*+ 00000020: 2c2d 2e2f 3031 3233 3435 3637 3839 3a3b ,-./0123456789:; 00000030: 3c3d 3e3f 4041 4243 4445 4647 4849 4a4b <=>?@ABCDEFGHIJK 00000040: 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[ 00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b \]^_`abcdefghijk 00000060: 6c6d 6e6f 7071 7273 7475 7677 7879 7a7b lmnopqrstuvwxyz{ 00000070: 7c7d 7e7f ffad 9b9c 9da6 aeaa f8f1 fde6 |}~............. 00000080: faa7 afac aba8 8e8f 9280 90a5 999a e185 ................ 00000090: a083 8486 9187 8a82 8889 8da1 8c8b a495 ................ 000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0 ................ 000000b0: ebee e3e5 e7ed fc9e f9fb ecef f7f0 f3f2 ................ 000000c0: a9f4 f5c4 b3da bfc0 d9c3 b4c2 c1c5 cdba ................ 000000d0: d5d6 c9b8 b7bb d4d3 c8be bdbc c6c7 ccb5 ................ 000000e0: b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................ 000000f0: b0b1 b2fe 0a ..... Has anyone encountered something similar? Kind regards, Adam Obeng
Martin Maechler
2016-Jun-08 08:50 UTC
[Rd] readlines() truncates text file with Codepage 437 encoding
Appended is the file -- you need to tell your e-mail software to use one of the MIME types that R-devel does accept; text/plain is what I chose ((Yes, as R mailing list server "operator", with a bit of detective work, I was able to find the "uncleaned" e-mail and extract the attachment from it)) Martin Maechler ETH Zurich -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 437__characters.txt URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20160608/d0b88325/attachment.txt> -------------- next part -------------->>>>> Adam Obeng <adam.obeng at columbia.edu> >>>>> on Mon, 6 Jun 2016 11:11:21 +0100 writes:> Hello r-devel, The attached Code page 437-encoded file > contains 245 characters (including the final newline), but > readLines only reads 242 of them: >> test_text <- readLines(file('437__characters.txt', >> encoding='437')) > Warning message: In readLines(file("437__characters.txt", > : incomplete final line found on '437__characters.txt' >> test_text > [1] > "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 > !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177 > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????" >> nchar(test_text) > [1] 242 > You'll note that readLines hasn't read the final > characters "??\n". > # Diagnostics > My best guess is that this is something to do with how > readLines() determines when it has reached EOF, because of > the following: > - The file is terminated with an ASCII LF (0x0a), but R > gives an 'incomplete final line found' warning. Note that > in some implementations of Code page 437, 0x0a is > interpreted as a graphical character rather than a control > character, but this does not seem to be the problem here. > The same problem occurs if the file ends with 0x0d or 0x0d > 0x0a. - Adding seven or more characters to the end of the > file makes it read correctly - Similarly, the file is read > correctly if you remove three characters from anywhere in > the file - The same issue seems to occur with reading > files encoded in other DOS code pages > # Additional information >> sessionInfo() > R version 3.2.3 (2015-12-10) Platform: > x86_64-apple-darwin14.5.0 (64-bit) Running under: OS X > 10.10.5 (Yosemite) > locale: [1] > en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > attached base packages: [1] stats graphics grDevices utils > datasets methods base > The same behaviour occurs under R 2.15.1 on a Linux > server. > In case the attached file is somehow corrupted, here is a > hexdump: > 00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b > ................ 00000010: 1c1d 1e1f 2021 2223 2425 2627 > 2829 2a2b .... !"#$%&'()*+ 00000020: 2c2d 2e2f 3031 3233 > 3435 3637 3839 3a3b ,-./0123456789:; 00000030: 3c3d 3e3f > 4041 4243 4445 4647 4849 4a4b <=>?@ABCDEFGHIJK 00000040: > 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[ > 00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b > \]^_`abcdefghijk 00000060: 6c6d 6e6f 7071 7273 7475 7677 > 7879 7a7b lmnopqrstuvwxyz{ 00000070: 7c7d 7e7f ffad 9b9c > 9da6 aeaa f8f1 fde6 |}~............. 00000080: faa7 afac > aba8 8e8f 9280 90a5 999a e185 ................ 00000090: > a083 8486 9187 8a82 8889 8da1 8c8b a495 ................ > 000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0 > ................ 000000b0: ebee e3e5 e7ed fc9e f9fb ecef > f7f0 f3f2 ................ 000000c0: a9f4 f5c4 b3da bfc0 > d9c3 b4c2 c1c5 cdba ................ 000000d0: d5d6 c9b8 > b7bb d4d3 c8be bdbc c6c7 ccb5 ................ 000000e0: > b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................ > 000000f0: b0b1 b2fe 0a ..... > Has anyone encountered something similar? > Kind regards, > Adam Obeng ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2016-Jun-09 14:40 UTC
[Rd] readlines() truncates text file with Codepage 437 encoding
I can reproduce the issue on Linux (Fedora F22), R 3.3.0 patched of today. Here's code for experimenting which allows to reproduce the issue without the need for an attached file (there's a temporary file created and removed as part of the function below) : ##--------------------------------------------------------------------------- ##' @title write-binary-readLines testing ##' @param i vector of integers in 0:255 to be used as character codes ##' @param file.name optional ##' @param encoding "437" is the one where the problem has been reported ##' @return the readLines() resulting character string with attributes ##' @author Martin Maechler wb.readL <- function(i, file.name = tempfile("bin"), encoding = "437") { stopifnot(is.integer(i), 0 <= i, i <= 255, is.character(file.name)) ff <- file(file.name, "wb") writeBin(as.raw(i), ff) close(ff) ; on.exit(unlink(file.name)) ## Now read "as codepage" : ch <- readLines(file(file.name, encoding = encoding)) ## --------- ------------------- typically gives warning structure(ch, fSize = file.size(file.name), nchars = c(b = nchar(ch, "b"), c = nchar(ch, "c"), w = nchar(ch, "w"))) } ii <- c(11:12, 14:255, 10L) (cc <- wb.readL(ii)) ##--------------------------------------------------------------------------- gives> (cc <- wb.readL(ii))[1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????" attr(,"fSize") [1] 245 attr(,"nchars") b c w 427 241 241 Warning message: In readLines(file(file.name, encoding = encoding)) : incomplete final line found on '/tmp/RtmpaPyDyp/bin65842896d5f1'>
Apparently Analagous Threads
- readlines() truncates text file with Codepage 437 encoding
- Tukey HSD
- Handling userland char ** pointers
- RODBC does not like table names >11/12 characters
- vlmc - "In vlmc(traffic.clusters.stationary, cutoff = i) : alphabet with >1-letter strings; trying to abbreviate"