I have a commonly recurring problem and wondered if folks would share tips. I routinely get tab-delimited text files that I need to read in. In very many cases, I get: > a <- read.table('junk.txt.txt',header=T,skip=10,sep="\t") Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 67 did not have 88 elements I am typically able to go through the file and find a single quote or something like that causing the problem, but with a recent set of files, I haven't been able to find such an issue. What can I do to get around this problem? I can use perl, also.... Thanks, Sean
?readLines I'm sure Perl will do nicely, but you can also use readLines and grep() or regexpr() the result in R as you would in Perl to find where the problem lies. ?nchar can also help to find a non-printing character that may be messing you up. It's no fun, I know. Excel files can be a particular pain, especially in their handling of missings. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis > Sent: Friday, February 25, 2005 12:12 PM > To: r-help > Subject: [R] read.table > > I have a commonly recurring problem and wondered if folks would share > tips. I routinely get tab-delimited text files that I need > to read in. > In very many cases, I get: > > > a <- read.table('junk.txt.txt',header=T,skip=10,sep="\t") > Error in scan(file = file, what = what, sep = sep, quote = > quote, dec = > dec, : > line 67 did not have 88 elements > > I am typically able to go through the file and find a single quote or > something like that causing the problem, but with a recent set of > files, I haven't been able to find such an issue. What can I > do to get > around this problem? I can use perl, also.... > > Thanks, > Sean > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
On 25-Feb-05 Sean Davis wrote:> I have a commonly recurring problem and wondered if folks > would share tips. I routinely get tab-delimited text files > that I need to read in. > In very many cases, I get: > > > a <- read.table('junk.txt.txt',header=T,skip=10,sep="\t") > Error in scan(file = file, what = what, sep = sep, quote = quote, > dec = dec, : > line 67 did not have 88 elements > > I am typically able to go through the file and find a single > quote or something like that causing the problem, but with a > recent set of files, I haven't been able to find such an issue. > What can I do to get around this problem? I can use perl, also....Hi Sean, This is only a shot in the dark, but your description has reminded me of similar messes in files which have been exported from Excel. What I have often done in such cases, to check (e.g.) the numbers of fields in records (using 'awk' on Linux) is on the following lines: cat filename | awk 'BEGIN{FS="\t"} {print NF}' | unique In that case, if there are varying numbers of fields then two or more different numbers will be printed instead of the single value which it should be. If you know how many fields to expect (e.g. 88), then you can find the line numbers of offending records by something like cat filename | awk 'BEGIN{FS="\t"} {if(NF!=88){print NR}}' In data files with a lot of records per line, doing it in this kind of way is vastly superior to trying to spot the problem by eye -- it's extemely difficult to count 88 tab-separated fields on screen! Hoping this helps! If not, supply further details and we'll see what we can think up. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 25-Feb-05 Time: 20:54:43 ------------------------------ XFMail ------------------------------
Maybe argument 'fill' of read.table is the solution. The default value is FALSE in read.table and, therefore, any line not having the same number of fields as the first line (not skipped) will make problems. If set to TRUE, as in read.delim and read.csv, lines with less number of fields get blank fields added at the end. If exporting tab delimited text files from Excel lines with empty fields at the end in the Excel file often have less fields than the header line in the text file. Reading them with read.delim fixes that. If the problem is more complicated you probably need to find the lines with count.fields and correct them manually. You can find them (actually the line number) with something like which(count.fields('data.txt') != count.fields('data.txt')[1]) assuming that the first line has the correct number of fields. Tilo
In addition to other suggestions made, note also count.fields(). > cat("10 9 17 # First of 7 lines", "11 13 1 6", "9 14 16", + "12 15 14", "8 15 15", "9 13 12", "7 14 18", + file="oneBadRow.txt", sep="\n") > nfields <- count.fields("oneBadRow.txt") > nfields [1] 3 4 3 3 3 3 3 > table(nfields) ## Use with many records nfields 3 4 6 1 > tab <- table(nfields) > (1:length(nfields))[nfields == 4] [1] 2 > readLines("oneBadRow.txt", n=-1)[2] [1] "11 13 1 6" Note the various option settings for count.fields() John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 26 Feb 2005, at 10:03 PM, r-help-request at stat.math.ethz.ch wrote:> From: Sean Davis <sdavis2 at mail.nih.gov> > Date: 26 February 2005 7:11:48 AM > To: r-help <r-help at stat.math.ethz.ch> > Subject: [R] read.table > > > I have a commonly recurring problem and wondered if folks would share > tips. I routinely get tab-delimited text files that I need to read > in. In very many cases, I get: > > > a <- read.table('junk.txt.txt',header=T,skip=10,sep="\t") > Error in scan(file = file, what = what, sep = sep, quote = quote, dec > = dec, : > line 67 did not have 88 elements > > I am typically able to go through the file and find a single quote or > something like that causing the problem, but with a recent set of > files, I haven't been able to find such an issue. What can I do to > get around this problem? I can use perl, also.... > > Thanks, > Sean