Sean O'Riordain
2013-Apr-10 14:20 UTC
[Rd] Issue with Control-Z in a text file on Windows - readLines() appears to truncate
Working on Windows I have had to deal with CSV files that, unfortunately, contain embedded Control-Zs, i.e. ASCII character 26 in decimal, and the readLines() function in R on Windows (2.15.2 and 3.0.0) appears to truncate at the control-Z. There is no problem at all on Ubuntu Linux with R 3.0.0. Am I mistaken or is this genuine? # Create a small file with embedded Control-Z h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(65, 26, 65))), '",99') h3 # "1,34,44.4,\" A\032A \",99" writeLines(h3, 'h3.txt') # now attempt to read the file back in h3a <- readLines('h3.txt') # but on Windows 2.15.2 and 3.0.0 I get the message #Warning message: #In readLines("h3.txt") : incomplete final line found on 'h3.txt' h3a # [1] "1,34,44.4,\" A" # so it drops from the Control-Z onwards #### # The following is my rough and ready workaround - I'm sure there is a cleaner way fnam <- 'h3.txt' tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(fnam)$size, 100)) tmp.char <- rawToChar(tmp.bin) txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE)) txt # [1] "1,34,44.4,\" A\032A \",99" This was on 64-bit R on a 64-bit Windows 7, but it also appears to be the case in a 32-bit R 2.15.2 on 32-bit Windows-7 inside in a VirtualBox. Kind regards, Sean O'Riordain Trinity College Dublin
Duncan Murdoch
2013-Apr-10 19:47 UTC
[Rd] Issue with Control-Z in a text file on Windows - readLines() appears to truncate
On 10/04/2013 10:20 AM, Sean O'Riordain wrote:> Working on Windows I have had to deal with CSV files that, > unfortunately, contain embedded Control-Zs, i.e. ASCII character 26 in > decimal, and the readLines() function in R on Windows (2.15.2 and > 3.0.0) appears to truncate at the control-Z. There is no problem at > all on Ubuntu Linux with R 3.0.0. > > Am I mistaken or is this genuine?Ctrl-Z is the old text file EOF marker from MSDOS. readLines() normally reads files in text mode using the Microsoft Visual C libraries, so I wouldn't be surprised if they respect Ctrl-Z as EOF. A simpler workaround than the one you used is to read the file in binary mode, e.g. f <- file("h3.txt", "rb") readLines(f) close(f) See the ?file help topic for a discussion of the limitations this may impose on you. Duncan Murdoch> > # Create a small file with embedded Control-Z > h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(65, 26, 65))), '",99') > h3 > # "1,34,44.4,\" A\032A \",99" > writeLines(h3, 'h3.txt') > > # now attempt to read the file back in > h3a <- readLines('h3.txt') > # but on Windows 2.15.2 and 3.0.0 I get the message > #Warning message: > #In readLines("h3.txt") : incomplete final line found on 'h3.txt' > h3a > # [1] "1,34,44.4,\" A" > # so it drops from the Control-Z onwards > > #### > # The following is my rough and ready workaround - I'm sure there is a > cleaner way > fnam <- 'h3.txt' > tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(fnam)$size, 100)) > tmp.char <- rawToChar(tmp.bin) > txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE)) > txt > # [1] "1,34,44.4,\" A\032A \",99" > > This was on 64-bit R on a 64-bit Windows 7, but it also appears to be > the case in a 32-bit R 2.15.2 on 32-bit Windows-7 inside in a > VirtualBox. > > Kind regards, > Sean O'Riordain > Trinity College > Dublin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel