thr3ads.net - R help - [R] Reading very large text files into R [Sep 2022]

If this information is useful, please help other people find it:
Share via:

@vi@e@gross m@iii@g oii gm@ii@com

2022-Sep-30 19:15 UTC

[R] Reading very large text files into R

Tim and others,

A point to consider is that there are various algorithms in the functions
used to read in formatted data into data.frame form and they vary. Some do a
look-ahead of some size to determine things and if they find a column that
LOOKS LIKE all integers for say the first thousand lines, they go and read
in that column as integer. If the first floating point value is thousands of
lines further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine
for an algorithm that looks ahead and concludes there are 16 columns
throughout. Yet a file where the first time a sixteenth entry is seen is at
line/row 31,459 may well just set the algorithm to expect exactly 15 columns
and then be surprised as noted above.

I have stayed out of this discussion and others have supplied pretty much
what I would have said. I also see the data as flawed and ask which rows are
the valid ones. If a sixteenth column is allowed, it would be better if all
other rows had an empty sixteenth column. If not allowed, none should have
it.

The approach I might take, again as others have noted, is to preprocess the
data file using some form of stream editor such as AWK that automagically
reads in a line at a time and parses lines into a collection of tokens based
on what separates them such as a comma. You can then either write out just
the first 15 to the output stream if your choice is to ignore a spurious
sixteenth, or write out all sixteen for every line, with the last being some
form of null most of the time. And, of course, to be more general, you could
make two passes through the file with the first one determining the maximum
number of entries as well as what the most common number of entries is, and
a second pass using that info to normalize the file the way you want. And
note some of what was mentioned could often be done in this preprocessing
such as removing any columns you do not want to read into R later. Do note
such filters may need to handle edge cases like skipping comment lines or
treating the row of headers differently.

As some have shown, you can create your own filters within a language like R
too and either read in lines and pre-process them as discussed or continue
on to making your own data.frame and skip the read.table() type of
functionality. For very large files, though, having multiple variations in
memory at once may be an issue, especially if they are not removed and
further processing and analysis continues.

Perhaps it might be sensible to contact those maintaining the data and point
out the anomaly and ask if their files might be saved alternately in a
format that can be used without anomalies.

Avi

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Ebert,Timothy
Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe <raoknz at gmail.com>; Nick Wray <nickmwray at
gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line
of data with 16 entries? 

Tim

Nick Wray

2022-Sep-30 19:39 UTC

head link

[R] Reading very large text files into R

Hello Thanks again for all the suggestions.  The irony is that for the
datasets I'm using the fill=T as suggested by Ivan in the first instance I
think works fine.  They're not particularly sophisticated datasets and
although I don't know what the extra Bs (of which the first one  as Avi
says does occur quite late on) actually mean I don't really need to know -
all I need is the date/time/station id/rainfall accumulation and that's
obvious once I've read the dataset in.  It has been interesting seeing the
takes of people who have a far deeper and wider understanding of R than I
do however and an education in itself... Nick

On Fri, 30 Sept 2022 at 20:16, <avi.e.gross at gmail.com> wrote:
> Tim and others,
>
> A point to consider is that there are various algorithms in the functions
> used to read in formatted data into data.frame form and they vary. Some do
> a
> look-ahead of some size to determine things and if they find a column that
> LOOKS LIKE all integers for say the first thousand lines, they go and read
> in that column as integer. If the first floating point value is thousands
> of
> lines further along, things may go wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work fine
> for an algorithm that looks ahead and concludes there are 16 columns
> throughout. Yet a file where the first time a sixteenth entry is seen is at
> line/row 31,459 may well just set the algorithm to expect exactly 15
> columns
> and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty much
> what I would have said. I also see the data as flawed and ask which rows
> are
> the valid ones. If a sixteenth column is allowed, it would be better if all
> other rows had an empty sixteenth column. If not allowed, none should have
> it.
>
> The approach I might take, again as others have noted, is to preprocess the
> data file using some form of stream editor such as AWK that automagically
> reads in a line at a time and parses lines into a collection of tokens
> based
> on what separates them such as a comma. You can then either write out just
> the first 15 to the output stream if your choice is to ignore a spurious
> sixteenth, or write out all sixteen for every line, with the last being
> some
> form of null most of the time. And, of course, to be more general, you
> could
> make two passes through the file with the first one determining the maximum
> number of entries as well as what the most common number of entries is, and
> a second pass using that info to normalize the file the way you want. And
> note some of what was mentioned could often be done in this preprocessing
> such as removing any columns you do not want to read into R later. Do note
> such filters may need to handle edge cases like skipping comment lines or
> treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language like
> R
> too and either read in lines and pre-process them as discussed or continue
> on to making your own data.frame and skip the read.table() type of
> functionality. For very large files, though, having multiple variations in
> memory at once may be an issue, especially if they are not removed and
> further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point
> out the anomaly and ask if their files might be saved alternately in a
> format that can be used without anomalies.
>
> Avi
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of
Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe <raoknz at gmail.com>; Nick Wray
<nickmwray at gmail.com>
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
>    Can you post one line of data with 15 entries followed by the next line
> of data with 16 entries?
>
> Tim
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ebert,Timothy Aaron

2022-Sep-30 21:58 UTC

head link

[R] Reading very large text files into R

The point was more to figure out why most lines have 15 values and some give an
error indicating that there are 16. Are there notes, or an extra comma? Some
weather stations fail and give interesting data at, before, or after failure.
Are the problem lines indicating machine failure? Typically code does not
randomly enter extra data. Most answers appear to assume that the 16th column
has been entered at the end of the data, but no evidence indicates this is true.
If there is an initial value at the beginning of the row, then all of the data
for that row will be in error if the "16" value is deleted. I am just
paranoid enough to suggest looking at one case to make sure all is as assumed.
   Another way to address the problem is to test the data. Are there
temperatures less than -100 C or greater than 60 C? Why would one ever get such
a thing? Machine error, or a column misaligned so that humidity values are in
the temperature column.

Tim 

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of avi.e.gross at
gmail.com
Sent: Friday, September 30, 2022 3:16 PM
Cc: r-help at r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

Tim and others,

A point to consider is that there are various algorithms in the functions used
to read in formatted data into data.frame form and they vary. Some do a
look-ahead of some size to determine things and if they find a column that LOOKS
LIKE all integers for say the first thousand lines, they go and read in that
column as integer. If the first floating point value is thousands of lines
further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine for
an algorithm that looks ahead and concludes there are 16 columns throughout. Yet
a file where the first time a sixteenth entry is seen is at line/row 31,459 may
well just set the algorithm to expect exactly 15 columns and then be surprised
as noted above.

I have stayed out of this discussion and others have supplied pretty much what I
would have said. I also see the data as flawed and ask which rows are the valid
ones. If a sixteenth column is allowed, it would be better if all other rows had
an empty sixteenth column. If not allowed, none should have it.

The approach I might take, again as others have noted, is to preprocess the data
file using some form of stream editor such as AWK that automagically reads in a
line at a time and parses lines into a collection of tokens based on what
separates them such as a comma. You can then either write out just the first 15
to the output stream if your choice is to ignore a spurious sixteenth, or write
out all sixteen for every line, with the last being some form of null most of
the time. And, of course, to be more general, you could make two passes through
the file with the first one determining the maximum number of entries as well as
what the most common number of entries is, and a second pass using that info to
normalize the file the way you want. And note some of what was mentioned could
often be done in this preprocessing such as removing any columns you do not want
to read into R later. Do note such filters may need to handle edge cases like
skipping comment lines or treating the row of headers differently.

As some have shown, you can create your own filters within a language like R too
and either read in lines and pre-process them as discussed or continue on to
making your own data.frame and skip the read.table() type of functionality. For
very large files, though, having multiple variations in memory at once may be an
issue, especially if they are not removed and further processing and analysis
continues.

Perhaps it might be sensible to contact those maintaining the data and point out
the anomaly and ask if their files might be saved alternately in a format that
can be used without anomalies.

Avi

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Ebert,Timothy
Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe <raoknz at gmail.com>; Nick Wray <nickmwray at
gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of
data with 16 entries?

Tim

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=5w0Yrih%2Fxf09zpgabscAzMTVzcw4nhjNKX5%2FgWEPVWk%3D&amp;reserved=0
PLEASE do read the posting guide
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=K4ddCaLbSB5XU8TELCMhDEFsG4drevbeRp2YKPxY2ag%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

Avi Gross

2022-Sep-30 23:01 UTC

head link

[R] Reading very large text files into R

Those are valid reasons as examining data and cleaning or fixing it is a
major thing to do before making an analysis or plots. Indeed, an extra
column caused by something in an earlier column mat have messed up all
columns to the right.

My point was about replicating a problem like this may require many more
lines from the file.

On Fri, Sep 30, 2022, 5:58 PM Ebert,Timothy Aaron <tebert at ufl.edu>
wrote:
> The point was more to figure out why most lines have 15 values and some
> give an error indicating that there are 16. Are there notes, or an extra
> comma? Some weather stations fail and give interesting data at, before, or
> after failure. Are the problem lines indicating machine failure? Typically
> code does not randomly enter extra data. Most answers appear to assume that
> the 16th column has been entered at the end of the data, but no evidence
> indicates this is true. If there is an initial value at the beginning of
> the row, then all of the data for that row will be in error if the
"16"
> value is deleted. I am just paranoid enough to suggest looking at one case
> to make sure all is as assumed.
>    Another way to address the problem is to test the data. Are there
> temperatures less than -100 C or greater than 60 C? Why would one ever get
> such a thing? Machine error, or a column misaligned so that humidity values
> are in the temperature column.
>
> Tim
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of
> avi.e.gross at gmail.com
> Sent: Friday, September 30, 2022 3:16 PM
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> [External Email]
>
> Tim and others,
>
> A point to consider is that there are various algorithms in the functions
> used to read in formatted data into data.frame form and they vary. Some do
> a look-ahead of some size to determine things and if they find a column
> that LOOKS LIKE all integers for say the first thousand lines, they go and
> read in that column as integer. If the first floating point value is
> thousands of lines further along, things may go wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work fine
> for an algorithm that looks ahead and concludes there are 16 columns
> throughout. Yet a file where the first time a sixteenth entry is seen is at
> line/row 31,459 may well just set the algorithm to expect exactly 15
> columns and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty much
> what I would have said. I also see the data as flawed and ask which rows
> are the valid ones. If a sixteenth column is allowed, it would be better if
> all other rows had an empty sixteenth column. If not allowed, none should
> have it.
>
> The approach I might take, again as others have noted, is to preprocess
> the data file using some form of stream editor such as AWK that
> automagically reads in a line at a time and parses lines into a collection
> of tokens based on what separates them such as a comma. You can then either
> write out just the first 15 to the output stream if your choice is to
> ignore a spurious sixteenth, or write out all sixteen for every line, with
> the last being some form of null most of the time. And, of course, to be
> more general, you could make two passes through the file with the first one
> determining the maximum number of entries as well as what the most common
> number of entries is, and a second pass using that info to normalize the
> file the way you want. And note some of what was mentioned could often be
> done in this preprocessing such as removing any columns you do not want to
> read into R later. Do note such filters may need to handle edge cases like
> skipping comment lines or treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language like
> R too and either read in lines and pre-process them as discussed or
> continue on to making your own data.frame and skip the read.table() type of
> functionality. For very large files, though, having multiple variations in
> memory at once may be an issue, especially if they are not removed and
> further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point out the anomaly and ask if their files might be saved alternately in
> a format that can be used without anomalies.
>
> Avi
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of
Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe <raoknz at gmail.com>; Nick Wray
<nickmwray at gmail.com>
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
>    Can you post one line of data with 15 entries followed by the next line
> of data with 16 entries?
>
> Tim
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
>
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=5w0Yrih%2Fxf09zpgabscAzMTVzcw4nhjNKX5%2FgWEPVWk%3D&amp;reserved=0
> PLEASE do read the posting guide
>
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=K4ddCaLbSB5XU8TELCMhDEFsG4drevbeRp2YKPxY2ag%3D&amp;reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Sep 2022 - Reading very large text files into R

[R] Reading very large text files into R

[R] Reading very large text files into R

[R] Reading very large text files into R

[R] Reading very large text files into R