John.Maindonald at anu.edu.au
2006-Jul-05 09:35 UTC
[Rd] read.table() errors with tab as separator (PR#9061)
(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input. > webtwo <- "http://www.maths.anu.edu.au/~johnm/testfiles/twotabs.txt" > xy <- read.table(url(webtwo), sep="\t") Warning message: number of items read is not a multiple of the number of columns > z <- count.fields(url(webtwo), sep="\t") > table(z) z 2 3 13 1387 > table(sapply(strsplit(readLines(url(webtwo)), split="\t"), length)) 3 1400 > readLines(url(webtwo))[z==2][9:13] # last 5 as a sample (shorter lines) [1] "865\tlinear model (lm)! Cook's distance\t152" [2] "1019\tlinear model (lm)! Cook's distance\t177" [3] "1048\tlinear model (lm)! Cook's distance\t183" [4] "1082\tlinear model (lm)! Cook's distance\t187" [5] "1220\tlinear model (lm)! Cook's distance\t214" > weblong <- "http://www.maths.anu.edu.au/~johnm/testfiles/long.txt" > webshort <- "http://www.maths.anu.edu.au/~johnm/testfiles/short.txt" > xyLong <- read.table(url(weblong), sep="\t") > dim(xyLong) # Should be 1821 x 2 [1] 1569 2 > xyShort <- read.table(url(webshort), sep="\t") > dim(xyShort) # Should be, and will be, 90 x 2 [1] 90 2 > long <- readLines(url(weblong)) > short <- readLines(url(webshort)) > length(long) [1] 1821 > length(short) [1] 90 > all(long[1561:1650]==short) # short is lines 1561:1650 of long [1] TRUE > ## Moreover strsplit() can pick up the \t's correctly > lsplit <- strsplit(long, "\t") > table(sapply(lsplit, length)) 2 1821 > # Try also table(sapply(lsplit, function(x)x[2])) --please do not edit the information below-- Version: platform = powerpc-apple-darwin8.6.0 arch = powerpc os = darwin8.6.0 system = powerpc, darwin8.6.0 status major = 2 minor = 3.1 year = 2006 month = 06 day = 01 svn rev = 38247 language = R version.string = Version 2.3.1 (2006-06-01) Locale: C Search Path: .GlobalEnv, package:lattice, package:methods, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, Autoloads, package:base
Peter Dalgaard
2006-Jul-05 09:50 UTC
[Rd] read.table() errors with tab as separator (PR#9061)
John.Maindonald at anu.edu.au writes:> (1) read.table(), with sep="\t", identifies 13 our of 1400 records, > in a file with 1400 records of 3 fields each, as having only 2 fields. > This happens under version 2.3.1 for Windows as well as with > R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. > [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] > > (2) Using read.table() with sep="\t", the first 1569 records only > of a 1821 record file are input. The file has exactly two fields > in each record, and the minimum length of the second field is > 1 character. If however I extract lines 1561 to 1650 from the > file (the file "short.txt" below), all 90 lines are input.Notice that the single quote is a quote character in read.table (as opposed to read.delim, which uses only the double quote, to cater for TAB-separated files from Excel & friends).> [1] "865\tlinear model (lm)! Cook's distance\t152"^ !!!! (This reminds me that we probably should shift the default for comment.char too since it leads to similar issues, but it seems not to be the problem in this case.) -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907