thr3ads.net - R help - [R] Reading very large text files into R [Sep 2022]

If this information is useful, please help other people find it:
Share via:

Ebert,Timothy Aaron

2022-Sep-30 11:26 UTC

[R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of
data with 16 entries?

Tim

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Richard
O'Keefe
Sent: Friday, September 30, 2022 12:08 AM
To: Nick Wray <nickmwray at gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

If I had this problem, in the old days I'd've whipped up a tiny AWK
script.  These days I might use xsv or qsv.
BUT
first I would want to know why these extra fields are present and what they
signify.  Are they good data that happen not to be described in the
documentation?  Do they represent a defect in the generation process?  What
other discrepancies are there?  If the data *format* cannot be fully trusted,
what does that say about the data *content*?  Do other data sets from the same
source have the same issue?  Is it possible to compare this version of the data
with an earlier version?

On Fri, 30 Sept 2022 at 02:54, Nick Wray <nickmwray at gmail.com> wrote:
> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from 
> the UK Met Office in notepad form.  Unfortunately,  I can't read them 
> directly into R because, for some reason, although most lines in the 
> text doc consist of 15 elements, every so often there is a sixteenth 
> one and R doesn't like this and gives me an error message because it 
> has assumed that every line has 15 elements and doesn't like finding 
> one with more.  I have tried playing around with the text document, 
> inserting an extra element into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to 
> get the files in question so this link probably won't work:
>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcata
> logue.ceda.ac.uk%2Fuuid%2Fbbd6916225e7475514e17fdbf11141c1&amp;data=05
> %7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f8
> 4a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFp
> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> 0%3D%7C3000%7C%7C%7C&amp;sdata=FEHsv515QPe4iXFMLlx9jwj4JXka7asxg771h6s
> 5nVg%3D&amp;reserved=0
>
> So what I have done is simply to copy and paste the text docs into 
> excel csv and then read them in, which is time-consuming but works.  
> However the later datasets are over the excel limit of 1048576 lines.  
> I can paste in the first 1048576 lines but then trying to isolate the 
> remainder of the text doc to paste it into a second csv doc is proving 
> v difficult - the only way I have found is to scroll down by hand and 
> that's taking ages.  I cannot find another way of editing the notepad 
> text doc to get rid of the part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text 
> tables into R  or b)suggest a way of editing out the bits of the text 
> file I have already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl
> .edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e
> 1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>
&amp;sdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3D&amp;rese
> rved=0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%
> 7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%
> 7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;
> sdata=DOkkKe1P474ELVoFjMtqWXawwQ5ouRR3ofjQEBPXKVM%3D&amp;reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3D&amp;reserved=0
PLEASE do read the posting guide
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=DOkkKe1P474ELVoFjMtqWXawwQ5ouRR3ofjQEBPXKVM%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

@vi@e@gross m@iii@g oii gm@ii@com

2022-Sep-30 19:15 UTC

head link

[R] Reading very large text files into R

Tim and others,

A point to consider is that there are various algorithms in the functions
used to read in formatted data into data.frame form and they vary. Some do a
look-ahead of some size to determine things and if they find a column that
LOOKS LIKE all integers for say the first thousand lines, they go and read
in that column as integer. If the first floating point value is thousands of
lines further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine
for an algorithm that looks ahead and concludes there are 16 columns
throughout. Yet a file where the first time a sixteenth entry is seen is at
line/row 31,459 may well just set the algorithm to expect exactly 15 columns
and then be surprised as noted above.

I have stayed out of this discussion and others have supplied pretty much
what I would have said. I also see the data as flawed and ask which rows are
the valid ones. If a sixteenth column is allowed, it would be better if all
other rows had an empty sixteenth column. If not allowed, none should have
it.

The approach I might take, again as others have noted, is to preprocess the
data file using some form of stream editor such as AWK that automagically
reads in a line at a time and parses lines into a collection of tokens based
on what separates them such as a comma. You can then either write out just
the first 15 to the output stream if your choice is to ignore a spurious
sixteenth, or write out all sixteen for every line, with the last being some
form of null most of the time. And, of course, to be more general, you could
make two passes through the file with the first one determining the maximum
number of entries as well as what the most common number of entries is, and
a second pass using that info to normalize the file the way you want. And
note some of what was mentioned could often be done in this preprocessing
such as removing any columns you do not want to read into R later. Do note
such filters may need to handle edge cases like skipping comment lines or
treating the row of headers differently.

As some have shown, you can create your own filters within a language like R
too and either read in lines and pre-process them as discussed or continue
on to making your own data.frame and skip the read.table() type of
functionality. For very large files, though, having multiple variations in
memory at once may be an issue, especially if they are not removed and
further processing and analysis continues.

Perhaps it might be sensible to contact those maintaining the data and point
out the anomaly and ask if their files might be saved alternately in a
format that can be used without anomalies.

Avi

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Ebert,Timothy
Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe <raoknz at gmail.com>; Nick Wray <nickmwray at
gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line
of data with 16 entries? 

Tim

R help - Sep 2022 - Reading very large text files into R

[R] Reading very large text files into R

[R] Reading very large text files into R