thr3ads.net - R help - [R] Speeding reading of a large file [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Fisher Dennis

2012-Dec-03 22:49 UTC

[R] Speeding reading of a large file

Colleagues,  

This past week, I asked the following question:

	I have a file that looks that this:

	TABLE NO.  1
	 PTID        TIME        AMT         FORM        PERIOD      IPRED       CWRES 
EVID        CP          PRED        RES         WRES
	  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00 
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00 
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00 
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01 0.0000E+00  0.0000E+00

	The file is reasonably large (> 10^6 lines) and the two line header is
repeated periodically in the file.
	I need to read this file in as a data frame.  Note that the number of columns,
the column headers, and the number of replicates of the headers are not known in
advance.

I received a number of replies, many of them quite useful.  Of these, one beat
out all the others in my benchmarking using files ranging from 10^5 to 10^6
lines.
That version, provided by Jim Holtman, was:
	x		<- read.table(FILE, as.is = TRUE, skip=1, fill=TRUE, header = TRUE)
	x[]		<- lapply(x, as.numeric)
	x		<- x[!is.na(x[,1]), ]

Other versions involved readLines, following by edits, following by cat (or
write) to a temp file, then read.table again.
The overhead with invoking readLines, write/cat, and read.table was
substantially larger than the strategy of read.table / as.numeric / indexing

Thanks for the input from many folks.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

Juliet Hannah

2012-Dec-06 16:24 UTC

head link

[R] Speeding reading of a large file

All,

Can someone describe what

 x[]             <- lapply(x, as.numeric)

I see that it is putting the list elements into a data frame. The
results for lapply are a list, so how does this become
a data frame.

Thanks,

Juliet


On Mon, Dec 3, 2012 at 5:49 PM, Fisher Dennis <fisher at plessthan.com>
wrote:> Colleagues,
>
> This past week, I asked the following question:
>
>         I have a file that looks that this:
>
>         TABLE NO.  1
>          PTID        TIME        AMT         FORM        PERIOD      IPRED 
CWRES       EVID        CP          PRED        RES         WRES
>           2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00 
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00 
0.0000E+00
>           2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00 
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00 0.0000E+00 
0.0000E+00
>           2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00 0.0000E+00 
0.0000E+00
>           2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00 0.0000E+00 
0.0000E+00
>           2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01 0.0000E+00 
0.0000E+00
>           2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01 0.0000E+00 
0.0000E+00
>           2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01 0.0000E+00 
0.0000E+00
>           2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01 0.0000E+00 
0.0000E+00
>
>         The file is reasonably large (> 10^6 lines) and the two line
header is repeated periodically in the file.
>         I need to read this file in as a data frame.  Note that the number
of columns, the column headers, and the number of replicates of the headers are
not known in advance.
>
> I received a number of replies, many of them quite useful.  Of these, one
beat out all the others in my benchmarking using files ranging from 10^5 to 10^6
lines.
> That version, provided by Jim Holtman, was:
>         x               <- read.table(FILE, as.is = TRUE, skip=1,
fill=TRUE, header = TRUE)
>         x[]             <- lapply(x, as.numeric)
>         x               <- x[!is.na(x[,1]), ]
>
> Other versions involved readLines, following by edits, following by cat (or
write) to a temp file, then read.table again.
> The overhead with invoking readLines, write/cat, and read.table was
substantially larger than the strategy of read.table / as.numeric / indexing
>
> Thanks for the input from many folks.
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2012-Dec-06 16:53 UTC

head link

[R] Speeding reading of a large file

Hello,

Because x[] keeps the dimensions, unlike just x.

Hope this helps,

Rui Barradas
Em 06-12-2012 16:24, Juliet Hannah escreveu:> All,
>
> Can someone describe what
>
>   x[]             <- lapply(x, as.numeric)
>
> I see that it is putting the list elements into a data frame. The
> results for lapply are a list, so how does this become
> a data frame.
>
> Thanks,
>
> Juliet
>
>
> On Mon, Dec 3, 2012 at 5:49 PM, Fisher Dennis <fisher at
plessthan.com> wrote:
>> Colleagues,
>>
>> This past week, I asked the following question:
>>
>>          I have a file that looks that this:
>>
>>          TABLE NO.  1
>>           PTID        TIME        AMT         FORM        PERIOD     
IPRED       CWRES       EVID        CP          PRED        RES         WRES
>>            2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00 
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00 
0.0000E+00
>>            2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00 
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00 0.0000E+00 
0.0000E+00
>>            2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00 0.0000E+00 
0.0000E+00
>>            2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00 0.0000E+00 
0.0000E+00
>>            2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01 0.0000E+00 
0.0000E+00
>>            2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01 0.0000E+00 
0.0000E+00
>>            2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01 0.0000E+00 
0.0000E+00
>>            2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01 0.0000E+00 
0.0000E+00
>>
>>          The file is reasonably large (> 10^6 lines) and the two
line header is repeated periodically in the file.
>>          I need to read this file in as a data frame.  Note that the
number of columns, the column headers, and the number of replicates of the
headers are not known in advance.
>>
>> I received a number of replies, many of them quite useful.  Of these,
one beat out all the others in my benchmarking using files ranging from 10^5 to
10^6 lines.
>> That version, provided by Jim Holtman, was:
>>          x               <- read.table(FILE, as.is = TRUE, skip=1,
fill=TRUE, header = TRUE)
>>          x[]             <- lapply(x, as.numeric)
>>          x               <- x[!is.na(x[,1]), ]
>>
>> Other versions involved readLines, following by edits, following by cat
(or write) to a temp file, then read.table again.
>> The overhead with invoking readLines, write/cat, and read.table was
substantially larger than the strategy of read.table / as.numeric / indexing
>>
>> Thanks for the input from many folks.
>>
>> Dennis
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Juliet Hannah

2012-Dec-06 17:39 UTC

head link

[R] Speeding reading of a large file

Thanks, it does help. Is it possible to elaborate on how specifically
why this syntax
preserves dimensions. It this correct to just say that even though
lapply returns a list, x[] forces x to have the
same dimensions?

On Thu, Dec 6, 2012 at 11:53 AM, Rui Barradas <ruipbarradas at sapo.pt>
wrote:> Hello,
>
> Because x[] keeps the dimensions, unlike just x.
>
> Hope this helps,
>
> Rui Barradas
> Em 06-12-2012 16:24, Juliet Hannah escreveu:
>
>> All,
>>
>> Can someone describe what
>>
>>   x[]             <- lapply(x, as.numeric)
>>
>> I see that it is putting the list elements into a data frame. The
>> results for lapply are a list, so how does this become
>> a data frame.
>>
>> Thanks,
>>
>> Juliet
>>
>>
>> On Mon, Dec 3, 2012 at 5:49 PM, Fisher Dennis <fisher at
plessthan.com>
>> wrote:
>>>
>>> Colleagues,
>>>
>>> This past week, I asked the following question:
>>>
>>>          I have a file that looks that this:
>>>
>>>          TABLE NO.  1
>>>           PTID        TIME        AMT         FORM        PERIOD
>>> IPRED       CWRES       EVID        CP          PRED        RES    
WRES
>>>            2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
0.0000E+00
>>> 0.0000E+00
>>>            2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00 
0.0000E+00
>>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
0.0000E+00
>>> 0.0000E+00
>>>
>>>          The file is reasonably large (> 10^6 lines) and the two
line
>>> header is repeated periodically in the file.
>>>          I need to read this file in as a data frame.  Note that
the
>>> number of columns, the column headers, and the number of replicates
of the
>>> headers are not known in advance.
>>>
>>> I received a number of replies, many of them quite useful.  Of
these, one
>>> beat out all the others in my benchmarking using files ranging from
10^5 to
>>> 10^6 lines.
>>> That version, provided by Jim Holtman, was:
>>>          x               <- read.table(FILE, as.is = TRUE,
skip=1,
>>> fill=TRUE, header = TRUE)
>>>          x[]             <- lapply(x, as.numeric)
>>>          x               <- x[!is.na(x[,1]), ]
>>>
>>> Other versions involved readLines, following by edits, following by
cat
>>> (or write) to a temp file, then read.table again.
>>> The overhead with invoking readLines, write/cat, and read.table was
>>> substantially larger than the strategy of read.table / as.numeric /
indexing
>>>
>>> Thanks for the input from many folks.
>>>
>>> Dennis
>>>
>>> Dennis Fisher MD
>>> P < (The "P Less Than" Company)
>>> Phone: 1-866-PLessThan (1-866-753-7784)
>>> Fax: 1-866-PLessThan (1-866-753-7784)
>>> www.PLessThan.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

R help - Dec 2012 - Speeding reading of a large file

[R] Speeding reading of a large file

[R] Speeding reading of a large file

[R] Speeding reading of a large file

[R] Speeding reading of a large file