Hello, all. I am working on a project with a large (~350Mb, about 5800 rows) insurance claims dataset. It was supplied in a tilde(~)-delimited format. I imported it into a data frame in R by setting memory.limit to maximum (4Gb) for my computer and using read.table. The resulting data frame had 10 bad rows. The errors appear due to read.table missing delimiter characters, with multiple data being imported into the same cell, then the remainder of the row and the next run together and garbled due to the reading frame shift (example: a single cell might contain: <datum>~ ~ <datum> ~<datum>, after which all the cells of the row and the next are wrong). To replicate, I tried the same import procedure on a smaller demographics data set from the same supplier- only about 1Mb, and got the same kinds of errors (5 bad rows in about 3500). I also imported as much of the file as Excel would hold and cross-checked, Excel did not produce the same errors but can't handle the entire file. I have used read.table on a number of other formats (mainly csv and tab-delimited) without such problems; so far it appears there's something different about these files that produces the errors but I can't see what it would be. Does anyone have any thoughts about what is going wrong? And is there a way, short of manual correction, for fixing it? Thanks for all help, ~Pat. Pat Carroll. what matters most is how well you walk through the fire. bukowski.
On 16-Jul-07 14:13:09, Pat Carroll wrote:> Hello, all. > > I am working on a project with a large (~350Mb, about 5800 rows) > insurance claims dataset. It was supplied in a tilde(~)-delimited > format. I imported it into a data frame in R by setting memory.limit to > maximum (4Gb) for my computer and using read.table.I had a similar problem put to me some time back, and eventually solved it by going in with a scalpel. It turned out that there was a problem with muddling "End-of-Line" with field delimiter in creating the file. And the file came out of Excel in the first place ... (did yours?). Quite why excell made this particular mess of it remains a mystery. I note that your file size is "350Mb" and "about 5800 rows". Doing some arithmetic on that: 350 * 1024 * 1024 = 367,001,600 bytes 367001600/5800 = 63276.14 bytes per row. This (given your "about"s) looks to me dangerously close to 65536 = 64K, and this may be a limit on what Excel can handle? Just a thought ... Ted.> The resulting data frame had 10 bad rows. The errors appear due to > read.table missing delimiter characters, with multiple data being > imported into the same cell, then the remainder of the row and the next > run together and garbled due to the reading frame shift (example: a > single cell might contain: <datum>~ ~ <datum> ~<datum>, after which all > the cells of the row and the next are wrong). > > To replicate, I tried the same import procedure on a smaller > demographics data set from the same supplier- only about 1Mb, and got > the same kinds of errors (5 bad rows in about 3500). I also imported as > much of the file as Excel would hold and cross-checked, Excel did not > produce the same errors but can't handle the entire file. I have used > read.table on a number of other formats (mainly csv and tab-delimited) > without such problems; so far it appears there's something different > about these files that produce > s the errors but I can't see what it would be. > > Does anyone have any thoughts about what is going wrong? And is there a > way, short of manual correction, for fixing it? > > Thanks for all help, > ~Pat. > > > Pat Carroll. > what matters most is how well you walk through the fire. > bukowski. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 16-Jul-07 Time: 15:59:48 ------------------------------ XFMail ------------------------------
Whenever I've had this kind of problem it has been either: the input data file is "corrupt", by which I mean not all lines have the same number of fields I have miss-specified one of the arguments to read.table() (usually comment.char or quote) Use count.fields() on an offending file to find out if all records have the same number of delimiters. If they don't, then look carefully at the ones that don't to see how they depart from the assumption that all rows have the same number of delimiters. Check for "non-standard" characters like control character sequences. Don't know what you mean by "read.table missing delimiter characters". If the delimiters are there, read.table will see them. But if they're inside quotes (the 'quote' argument of read.table) or after a comment character (the 'comment.char' argument), for example, I wouldn't expect them to be interpreted as delimiters. If you were to edit one of the data files outside of R, changing the delimiters from tilde to something else, maybe TAB, and find that it reads correctly, then there might be an issue with read.table(). Unlikely, though. If you can find the offending rows, put them into a separate file, and import them into Excel, or a text editor that shows everything, maybe it will become obvious. -Don At 7:13 AM -0700 7/16/07, Pat Carroll wrote:>Hello, all. > >I am working on a project with a large (~350Mb, about 5800 rows) >insurance claims dataset. It was supplied in a tilde(~)-delimited >format. I imported it into a data frame in R by setting memory.limit >to maximum (4Gb) for my computer and using read.table. > >The resulting data frame had 10 bad rows. The errors appear due to >read.table missing delimiter characters, with multiple data being >imported into the same cell, then the remainder of the row and the >next run together and garbled due to the reading frame shift >(example: a single cell might contain: <datum>~ ~ <datum> ~<datum>, >after which all the cells of the row and the next are wrong). > >To replicate, I tried the same import procedure on a smaller >demographics data set from the same supplier- only about 1Mb, and >got the same kinds of errors (5 bad rows in about 3500). I also >imported as much of the file as Excel would hold and cross-checked, >Excel did not produce the same errors but can't handle the entire >file. I have used read.table on a number of other formats (mainly >csv and tab-delimited) without such problems; so far it appears >there's something different about these files that produces the >errors but I can't see what it would be. > >Does anyone have any thoughts about what is going wrong? And is >there a way, short of manual correction, for fixing it? > >Thanks for all help, >~Pat. > > >Pat Carroll. >what matters most is how well you walk through the fire. >bukowski. > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062
On Mon, 16 Jul 2007, Pat Carroll wrote:> Hello, all. > > I am working on a project with a large (~350Mb, about 5800 rows) > insurance claims dataset. It was supplied in a tilde(~)-delimited > format. I imported it into a data frame in R by setting memory.limit to > maximum (4Gb) for my computer and using read.table. > > The resulting data frame had 10 bad rows. The errors appear due to > read.table missing delimiter characters, with multiple data being > imported into the same cell, then the remainder of the row and the next > run together and garbled due to the reading frame shift (example: a > single cell might contain: <datum>~ ~ <datum> ~<datum>, after which all > the cells of the row and the next are wrong). > > To replicate, I tried the same import procedure on a smaller > demographics data set from the same supplier- only about 1Mb, and got > the same kinds of errors (5 bad rows in about 3500). I also imported as > much of the file as Excel would hold and cross-checked, Excel did not > produce the same errors but can't handle the entire file. I have used > read.table on a number of other formats (mainly csv and tab-delimited) > without such problems; so far it appears there's something different > about these files that produces the errors but I can't see what it would > be.The usual cause is that the user forgot about quotes and comment characters. Try quote="", comment.char="" If that does not work, please follow the request in the footer of every message on this list.> Does anyone have any thoughts about what is going wrong? And is there a > way, short of manual correction, for fixing it? > > Thanks for all help, > ~Pat. > > > Pat Carroll. > what matters most is how well you walk through the fire. > bukowski. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595