thr3ads.net - R help - [R] Errors in data frames from read.table [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Pat Carroll

2007-Jul-16 14:13 UTC

[R] Errors in data frames from read.table

Hello, all.

I am working on a project with a large (~350Mb, about 5800 rows) insurance
claims dataset. It was supplied in a tilde(~)-delimited format. I imported it
into a data frame in R by setting memory.limit to maximum (4Gb) for my computer
and using read.table.

The resulting data frame had 10 bad rows. The errors appear due to read.table
missing delimiter characters, with multiple data being imported into the same
cell, then the remainder of the row and the next run together and garbled due to
the reading frame shift (example: a single cell might contain: <datum>~ ~
<datum> ~<datum>, after which all the cells of the row and the next
are wrong).

To replicate, I tried the same import procedure on a smaller demographics data
set from the same supplier- only about 1Mb, and got the same kinds of errors (5
bad rows in about 3500). I also imported as much of the file as Excel would hold
and cross-checked, Excel did not produce the same errors but can't handle
the entire file. I have used read.table on a number of other formats (mainly csv
and tab-delimited) without such problems; so far it appears there's
something different about these files that produces the errors but I can't
see what it would be.

Does anyone have any thoughts about what is going wrong? And is there a way,
short of manual correction, for fixing it?

Thanks for all help,
~Pat.


Pat Carroll. 
what matters most is how well you walk through the fire. 
bukowski.

(Ted Harding)

2007-Jul-16 15:00 UTC

head link

[R] Errors in data frames from read.table

On 16-Jul-07 14:13:09, Pat Carroll wrote:> Hello, all.
> 
> I am working on a project with a large (~350Mb, about 5800 rows)
> insurance claims dataset. It was supplied in a tilde(~)-delimited
> format. I imported it into a data frame in R by setting memory.limit to
> maximum (4Gb) for my computer and using read.table. 
I had a similar problem put to me some time back, and eventually
solved it by going in with a scalpel. It turned out that there
was a problem with muddling "End-of-Line" with field delimiter
in creating the file. And the file came out of Excel in the first
place ... (did yours?). Quite why excell made this particular
mess of it remains a mystery.

I note that your file size is "350Mb" and "about 5800 rows".
Doing some arithmetic on that:

350 * 1024 * 1024 = 367,001,600 bytes

367001600/5800 = 63276.14 bytes per row.

This (given your "about"s) looks to me dangerously close to
65536 = 64K, and this may be a limit on what Excel can handle?

Just a thought ...
Ted.
> The resulting data frame had 10 bad rows. The errors appear due to
> read.table missing delimiter characters, with multiple data being
> imported into the same cell, then the remainder of the row and the next
> run together and garbled due to the reading frame shift (example: a
> single cell might contain: <datum>~ ~ <datum> ~<datum>,
after which all
> the cells of the row and the next are wrong). 
> 
> To replicate, I tried the same import procedure on a smaller
> demographics data set from the same supplier- only about 1Mb, and got
> the same kinds of errors (5 bad rows in about 3500). I also imported as
> much of the file as Excel would hold and cross-checked, Excel did not
> produce the same errors but can't handle the entire file. I have used
> read.table on a number of other formats (mainly csv and tab-delimited)
> without such problems; so far it appears there's something different
> about these files that produce
> s the errors but I can't see what it would be.
> 
> Does anyone have any thoughts about what is going wrong? And is there a
> way, short of manual correction, for fixing it?
> 
> Thanks for all help,
> ~Pat.
> 
> 
> Pat Carroll. 
> what matters most is how well you walk through the fire. 
> bukowski.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Jul-07                                       Time: 15:59:48
------------------------------ XFMail ------------------------------

Don MacQueen

2007-Jul-16 15:09 UTC

head link

[R] Errors in data frames from read.table

Whenever I've had this kind of problem it has been either:
    the input data file is "corrupt", by which I mean not all lines 
have the same number of fields
    I have miss-specified one of the arguments to read.table() 
(usually comment.char or quote)

Use count.fields() on an offending file to find out if all records 
have the same number of delimiters. If they don't, then look 
carefully at the ones that don't to see how they depart from the 
assumption that all rows have the same number of delimiters. Check 
for "non-standard" characters like control character sequences.

Don't know what you mean by "read.table missing delimiter 
characters". If the delimiters are there, read.table will see them. 
But if they're inside quotes (the 'quote' argument of read.table) or
after a comment character (the 'comment.char' argument), for example, 
I wouldn't expect them to be interpreted as delimiters.

If you were to edit one of the data files outside of R, changing the 
delimiters from tilde to something else, maybe TAB, and find that it 
reads correctly, then there might be an issue with read.table(). 
Unlikely, though.

If you can find the offending rows, put them into a separate file, 
and import them into Excel, or a text editor that shows everything, 
maybe it will become obvious.

-Don

At 7:13 AM -0700 7/16/07, Pat Carroll wrote:>Hello, all.
>
>I am working on a project with a large (~350Mb, about 5800 rows) 
>insurance claims dataset. It was supplied in a tilde(~)-delimited 
>format. I imported it into a data frame in R by setting memory.limit 
>to maximum (4Gb) for my computer and using read.table.
>
>The resulting data frame had 10 bad rows. The errors appear due to 
>read.table missing delimiter characters, with multiple data being 
>imported into the same cell, then the remainder of the row and the 
>next run together and garbled due to the reading frame shift 
>(example: a single cell might contain: <datum>~ ~ <datum>
~<datum>,
>after which all the cells of the row and the next are wrong).
>
>To replicate, I tried the same import procedure on a smaller 
>demographics data set from the same supplier- only about 1Mb, and 
>got the same kinds of errors (5 bad rows in about 3500). I also 
>imported as much of the file as Excel would hold and cross-checked, 
>Excel did not produce the same errors but can't handle the entire 
>file. I have used read.table on a number of other formats (mainly 
>csv and tab-delimited) without such problems; so far it appears 
>there's something different about these files that produces the 
>errors but I can't see what it would be.
>
>Does anyone have any thoughts about what is going wrong? And is 
>there a way, short of manual correction, for fixing it?
>
>Thanks for all help,
>~Pat.
>
>
>Pat Carroll.
>what matters most is how well you walk through the fire.
>bukowski.
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062

Prof Brian Ripley

2007-Jul-16 17:23 UTC

head link

[R] Errors in data frames from read.table

On Mon, 16 Jul 2007, Pat Carroll wrote:
> Hello, all.
>
> I am working on a project with a large (~350Mb, about 5800 rows) 
> insurance claims dataset. It was supplied in a tilde(~)-delimited 
> format. I imported it into a data frame in R by setting memory.limit to 
> maximum (4Gb) for my computer and using read.table.
>
> The resulting data frame had 10 bad rows. The errors appear due to 
> read.table missing delimiter characters, with multiple data being 
> imported into the same cell, then the remainder of the row and the next 
> run together and garbled due to the reading frame shift (example: a 
> single cell might contain: <datum>~ ~ <datum> ~<datum>,
after which all
> the cells of the row and the next are wrong).
>
> To replicate, I tried the same import procedure on a smaller 
> demographics data set from the same supplier- only about 1Mb, and got 
> the same kinds of errors (5 bad rows in about 3500). I also imported as 
> much of the file as Excel would hold and cross-checked, Excel did not 
> produce the same errors but can't handle the entire file. I have used 
> read.table on a number of other formats (mainly csv and tab-delimited) 
> without such problems; so far it appears there's something different 
> about these files that produces the errors but I can't see what it
would
> be.
The usual cause is that the user forgot about quotes and comment 
characters.  Try quote="", comment.char=""

If that does not work, please follow the request in the footer of every 
message on this list.
> Does anyone have any thoughts about what is going wrong? And is there a 
> way, short of manual correction, for fixing it?
>
> Thanks for all help,
> ~Pat.
>
>
> Pat Carroll.
> what matters most is how well you walk through the fire.
> bukowski.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maybe Matching Threads

Search for more maybe matching threads

R help - Jul 2007 - Errors in data frames from read.table

[R] Errors in data frames from read.table

[R] Errors in data frames from read.table

[R] Errors in data frames from read.table

[R] Errors in data frames from read.table

Maybe Matching Threads