thr3ads.net - R help - [R] Fwd: Reading very large text files into R [Sep 2022]

If this information is useful, please help other people find it:
Share via:

Nick Wray

2022-Sep-29 14:51 UTC

[R] Fwd: Reading very large text files into R

---------- Forwarded message ---------
From: Nick Wray <nickmwray at gmail.com>
Date: Thu, 29 Sept 2022 at 15:32
Subject: Re: [R] Reading very large text files into R
To: Ben Tupper <btupper at bigelow.org>


Hi Ben
Beneath is an example of the text (also in an attachment) and it's the
"B",
of which there are quite a few scattered throughout the text doc which
causes the reading in error message (btw I don't need the "RAIN"
column or
the 1's after it or the last four elements).   I have also attached the
snippet as text file

1980-01-01 10:00, 225620, RAIN, 1, 1, WAHRAIN, 5091, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 226918, RAIN, 1, 1, WAHRAIN, 5124, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 228562, RAIN, 1, 1, WAHRAIN, 491, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 231581, RAIN, 1, 1, WAHRAIN, 5213, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 232671, RAIN, 1, 1, WAHRAIN, 487, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 232913, RAIN, 1, 1, WAHRAIN, 5243, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 234362, RAIN, 1, 1, WAHRAIN, 5265, 1001, 0, , 10009, 0, ,
, B
1980-01-01 10:00, 234682, RAIN, 1, 1, WAHRAIN, 5271, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 235389, RAIN, 1, 1, WAHRAIN, 5279, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 236466, RAIN, 1, 1, WAHRAIN, 497, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 243350, RAIN, 1, 1, SREW, 484, 1001, 0, , 9, 0, , ,
1980-01-01 10:00, 243350, RAIN, 1, 1, WAHRAIN, 484, 1001, 0, 0, 9, 9, , ,

Thanks Nick

On Thu, 29 Sept 2022 at 15:12, Ben Tupper <btupper at bigelow.org> wrote:
> Hi Nick,
>
> It's hard to know without seeing at least a snippet of the data.
> Could you do the following and paste the result into a plain text
> email?  If you don't set your email client to plain text (from rich
> text or html) then we are apt to see a jumble of output on our email
> clients.
>
>
> ## start
> x <- readLines(filename, n = 20)
> cat(x, sep = "\n")
> ## end
>
> Cheers,
> Ben
>
>
> On Thu, Sep 29, 2022 at 9:54 AM Nick Wray <nickmwray at gmail.com>
wrote:
> >
> > Hello   I may be offending the R purists with this question but it is
> > linked to R, as will become clear.  I have very large data sets from
the
> UK
> > Met Office in notepad form.  Unfortunately,  I can?t read them
directly
> > into R because, for some reason, although most lines in the text doc
> > consist of 15 elements, every so often there is a sixteenth one and R
> > doesn?t like this and gives me an error message because it has assumed
> that
> > every line has 15 elements and doesn?t like finding one with more.  I
> have
> > tried playing around with the text document, inserting an extra
element
> > into the top line etc, but to no avail.
> >
> > Also unfortunately you need access permission from the Met Office to
get
> > the files in question so this link probably won?t work:
> >
> > https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
> >
> > So what I have done is simply to copy and paste the text docs into
excel
> > csv and then read them in, which is time-consuming but works.  However
> the
> > later datasets are over the excel limit of 1048576 lines.  I can paste
in
> > the first 1048576 lines but then trying to isolate the remainder of
the
> > text doc to paste it into a second csv doc is proving v difficult ?
the
> > only way I have found is to scroll down by hand and that?s taking
ages.
> I
> > cannot find another way of editing the notepad text doc to get rid of
the
> > part which I have already copied and pasted.
> >
> > Can anyone help with a)ideally being able to simply read the text
tables
> > into R  or b)suggest a way of editing out the bits of the text file I
> have
> > already pasted in without laborious scrolling?
> >
> > Thanks Nick Wray
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Ben Tupper (he/him)
> Bigelow Laboratory for Ocean Science
> East Boothbay, Maine
> http://www.bigelow.org/
> https://eco.bigelow.org
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sample text.txt
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20220929/feb43583/attachment.txt>

Enrico Schumann

2022-Sep-29 15:14 UTC

head link

[R] Fwd: Reading very large text files into R

On Thu, 29 Sep 2022, Nick Wray writes:
> ---------- Forwarded message ---------
> From: Nick Wray <nickmwray at gmail.com>
> Date: Thu, 29 Sept 2022 at 15:32
> Subject: Re: [R] Reading very large text files into R
> To: Ben Tupper <btupper at bigelow.org>
>
>
> Hi Ben
> Beneath is an example of the text (also in an attachment) and it's the
"B",
> of which there are quite a few scattered throughout the text doc which
> causes the reading in error message (btw I don't need the
"RAIN" column or
> the 1's after it or the last four elements).   I have also attached the
> snippet as text file
>
> 1980-01-01 10:00, 225620, RAIN, 1, 1, WAHRAIN, 5091, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 226918, RAIN, 1, 1, WAHRAIN, 5124, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 228562, RAIN, 1, 1, WAHRAIN, 491, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 231581, RAIN, 1, 1, WAHRAIN, 5213, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 232671, RAIN, 1, 1, WAHRAIN, 487, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 232913, RAIN, 1, 1, WAHRAIN, 5243, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 234362, RAIN, 1, 1, WAHRAIN, 5265, 1001, 0, , 10009, 0, ,
> , B
> 1980-01-01 10:00, 234682, RAIN, 1, 1, WAHRAIN, 5271, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 235389, RAIN, 1, 1, WAHRAIN, 5279, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 236466, RAIN, 1, 1, WAHRAIN, 497, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 243350, RAIN, 1, 1, SREW, 484, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 243350, RAIN, 1, 1, WAHRAIN, 484, 1001, 0, 0, 9, 9, , ,
>
> Thanks Nick
>
> On Thu, 29 Sept 2022 at 15:12, Ben Tupper <btupper at bigelow.org>
wrote:
>
>> Hi Nick,
>>
>> It's hard to know without seeing at least a snippet of the data.
>> Could you do the following and paste the result into a plain text
>> email?  If you don't set your email client to plain text (from rich
>> text or html) then we are apt to see a jumble of output on our email
>> clients.
>>
>>
>> ## start
>> x <- readLines(filename, n = 20)
>> cat(x, sep = "\n")
>> ## end
>>
>> Cheers,
>> Ben
>>
>>
>> On Thu, Sep 29, 2022 at 9:54 AM Nick Wray <nickmwray at
gmail.com> wrote:
>> >
>> > Hello   I may be offending the R purists with this question but it
is
>> > linked to R, as will become clear.  I have very large data sets
from the
>> UK
>> > Met Office in notepad form.  Unfortunately,  I can?t read them
directly
>> > into R because, for some reason, although most lines in the text
doc
>> > consist of 15 elements, every so often there is a sixteenth one
and R
>> > doesn?t like this and gives me an error message because it has
assumed
>> that
>> > every line has 15 elements and doesn?t like finding one with more.
I
>> have
>> > tried playing around with the text document, inserting an extra
element
>> > into the top line etc, but to no avail.
>> >
>> > Also unfortunately you need access permission from the Met Office
to get
>> > the files in question so this link probably won?t work:
>> >
>> > https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>> >
>> > So what I have done is simply to copy and paste the text docs into
excel
>> > csv and then read them in, which is time-consuming but works. 
However
>> the
>> > later datasets are over the excel limit of 1048576 lines.  I can
paste in
>> > the first 1048576 lines but then trying to isolate the remainder
of the
>> > text doc to paste it into a second csv doc is proving v difficult
? the
>> > only way I have found is to scroll down by hand and that?s taking
ages.
>> I
>> > cannot find another way of editing the notepad text doc to get rid
of the
>> > part which I have already copied and pasted.
>> >
>> > Can anyone help with a)ideally being able to simply read the text
tables
>> > into R  or b)suggest a way of editing out the bits of the text
file I
>> have
>> > already pasted in without laborious scrolling?
>> >
>> > Thanks Nick Wray
>> >
[...]
>>
>> --
>> Ben Tupper (he/him)
>> Bigelow Laboratory for Ocean Science
>> East Boothbay, Maine
>> http://www.bigelow.org/
>> https://eco.bigelow.org
>>
>
Maybe I have missed it, but could you please show how
you tried to read the table?

When I use your file with 

    read.table("sample text.txt", header = FALSE, sep = ",")

I get

    ##                  V1     V2    V3 V4 V5       V6   V7   V8 V9 V10   V11
V12 V13 V14 V15
    ## 1  1980-01-01 10:00 225620  RAIN  1  1  WAHRAIN 5091 1001  0  NA     9  
0  NA  NA
    ## 2  1980-01-01 10:00 226918  RAIN  1  1  WAHRAIN 5124 1001  0  NA     9  
0  NA  NA
    ## ## .....
    ## 7  1980-01-01 10:00 234362  RAIN  1  1  WAHRAIN 5265 1001  0  NA 10009  
0  NA  NA   B
    ## 8  1980-01-01 10:00 234682  RAIN  1  1  WAHRAIN 5271 1001  0  NA     9  
0  NA  NA



-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

R help - Sep 2022 - Fwd: Reading very large text files into R

[R] Fwd: Reading very large text files into R

[R] Fwd: Reading very large text files into R