thr3ads.net - R help - [R] Speeding reading of large file [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Fisher Dennis

2012-Nov-28 17:42 UTC

[R] Speeding reading of large file

R 2.15.1
OS X and Windows

Colleagues,

I have a file that looks that this:
TABLE NO.  1
 PTID        TIME        AMT         FORM        PERIOD      IPRED       CWRES  
EVID        CP          PRED        RES         WRES
  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00 
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00
  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00 
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00
  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00 
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00
  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00
  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00
  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00
  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00

The file is reasonably large (> 10^6 lines) and the two line header is
repeated periodically in the file.
I need to read this file in as a data frame.  Note that the number of columns,
the column headers, and the number of replicates of the headers are not known in
advance.

I have tried two approaches to this:
	First Approach:  
		1.  readLines(FILENAME) to read in the file
		2.  use grep to find the repeat headers; strip out the repeat headers
		3.  write() the object to tempfile, read in that temporary file using
read.table(tempfile, header=TRUE, skip=1) [an alternative is to use
textConnection but that does not appear to speed things]

	Second Approach:
		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE,
as.is=TRUE)
		2.  get rid of the errant entries with:
			TEMP[!is.na(as.numeric(TEMP[,1])),]
		3.  reading of the character entries forced all columns to character mode. 
Therefore, I convert each column to numeric:
			for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL]) 
The second approach is ~ 20% faster than the first.  With the second approach,
the conversion to numeric occupies 50% of the elapsed time.

Is there some approach that would be much faster?  For example, would a
vectorized approach to conversion to numeric improve throughput?  Or, is there
some means to ensure that all data are read as numeric (I tried to use
colClasses but that triggered an error when the text string was encountered).

############################
A dput version of the data is:
c("TABLE NO.  1", " PTID        TIME        AMT         FORM     
PERIOD      IPRED       CWRES       EVID        CP          PRED        RES     
WRES",
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00 
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00 
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00 
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00"
)

This can be assembled into a large dataset and written to a file named FILENAME
with the following code:
cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM 
PERIOD      IPRED       CWRES       EVID        CP          PRED        RES     
WRES",
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00 
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00 
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00 
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00",
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00",
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")


Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

Rui Barradas

2012-Nov-28 18:21 UTC

head link

[R] Speeding reading of large file

Hello,

Try to convert back to numeric with

TEMP[] <- lapply(TEMP, as.numeric)


In a small df (1e6 rows, 2 cols) it's 3 times faster than the for loop.

Hope this helps,

Rui Barradas
Em 28-11-2012 17:42, Fisher Dennis escreveu:> R 2.15.1
> OS X and Windows
>
> Colleagues,
>
> I have a file that looks that this:
> TABLE NO.  1
>   PTID        TIME        AMT         FORM        PERIOD      IPRED      
CWRES       EVID        CP          PRED        RES         WRES
>    2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>    2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00 
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00
>    2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00 
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00
>    2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00 
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00
>    2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00
>    2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00
>    2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00
>    2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01 
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00
>
> The file is reasonably large (> 10^6 lines) and the two line header is
repeated periodically in the file.
> I need to read this file in as a data frame.  Note that the number of
columns, the column headers, and the number of replicates of the headers are not
known in advance.
>
> I have tried two approaches to this:
> 	First Approach:
> 		1.  readLines(FILENAME) to read in the file
> 		2.  use grep to find the repeat headers; strip out the repeat headers
> 		3.  write() the object to tempfile, read in that temporary file using
read.table(tempfile, header=TRUE, skip=1) [an alternative is to use
textConnection but that does not appear to speed things]
>
> 	Second Approach:
> 		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE,
as.is=TRUE)
> 		2.  get rid of the errant entries with:
> 			TEMP[!is.na(as.numeric(TEMP[,1])),]
> 		3.  reading of the character entries forced all columns to character
mode.  Therefore, I convert each column to numeric:
> 			for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL])
> The second approach is ~ 20% faster than the first.  With the second
approach, the conversion to numeric occupies 50% of the elapsed time.
>
> Is there some approach that would be much faster?  For example, would a
vectorized approach to conversion to numeric improve throughput?  Or, is there
some means to ensure that all data are read as numeric (I tried to use
colClasses but that triggered an error when the text string was encountered).
>
> ############################
> A dput version of the data is:
> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
PERIOD      IPRED       CWRES       EVID        CP          PRED        RES     
WRES",
> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00 
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00 
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00 
0.0000E+00"
> )
>
> This can be assembled into a large dataset and written to a file named
FILENAME with the following code:
> cat(c("TABLE NO.  1", " PTID        TIME        AMT        
FORM        PERIOD      IPRED       CWRES       EVID        CP          PRED    
RES         WRES",
> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00 
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00 
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00 
0.0000E+00",
> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00 
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00 
0.0000E+00"
> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>
>
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David L Carlson

2012-Nov-28 19:32 UTC

head link

[R] Speeding reading of large file

Using your first approach, this should be faster 

raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Fisher Dennis
> Sent: Wednesday, November 28, 2012 11:43 AM
> To: r-help at r-project.org
> Subject: [R] Speeding reading of large file
> 
> R 2.15.1
> OS X and Windows
> 
> Colleagues,
> 
> I have a file that looks that this:
> TABLE NO.  1
>  PTID        TIME        AMT         FORM        PERIOD      IPRED
> CWRES       EVID        CP          PRED        RES         WRES
>   2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00
>   2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
> 0.0000E+00
>   2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
> 0.0000E+00
>   2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
> 0.0000E+00
>   2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
> 0.0000E+00
>   2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
> 0.0000E+00
>   2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
> 0.0000E+00
>   2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
> 0.0000E+00
> 
> The file is reasonably large (> 10^6 lines) and the two line header is
> repeated periodically in the file.
> I need to read this file in as a data frame.  Note that the number of
> columns, the column headers, and the number of replicates of the
> headers are not known in advance.
> 
> I have tried two approaches to this:
> 	First Approach:
> 		1.  readLines(FILENAME) to read in the file
> 		2.  use grep to find the repeat headers; strip out the
> repeat headers
> 		3.  write() the object to tempfile, read in that temporary
> file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
> to use textConnection but that does not appear to speed things]
> 
> 	Second Approach:
> 		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1,
> fill=TRUE, as.is=TRUE)
> 		2.  get rid of the errant entries with:
> 			TEMP[!is.na(as.numeric(TEMP[,1])),]
> 		3.  reading of the character entries forced all columns to
> character mode.  Therefore, I convert each column to numeric:
> 			for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
> as.numeric(TEMP[,COL])
> The second approach is ~ 20% faster than the first.  With the second
> approach, the conversion to numeric occupies 50% of the elapsed time.
> 
> Is there some approach that would be much faster?  For example, would a
> vectorized approach to conversion to numeric improve throughput?  Or,
> is there some means to ensure that all data are read as numeric (I
> tried to use colClasses but that triggered an error when the text
> string was encountered).
> 
> ############################
> A dput version of the data is:
> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
> PERIOD      IPRED       CWRES       EVID        CP          PRED
> RES         WRES",
> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
> 0.0000E+00"
> )
> 
> This can be assembled into a large dataset and written to a file named
> FILENAME with the following code:
> cat(c("TABLE NO.  1", " PTID        TIME        AMT        
FORM
> PERIOD      IPRED       CWRES       EVID        CP          PRED
> RES         WRES",
> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
> 0.0000E+00",
> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
> 0.0000E+00"
> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
> 
> 
> Dennis
> 
> 
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Fisher Dennis

2012-Nov-28 19:42 UTC

head link

[R] Speeding reading of large file

An interesting approach -- I lose the column names (which I need) but I could
get them with something cute such as:
	1.  read the first few lines only with readLines(FILENAME, n=10)
	2.  use your approach to read.table -- this will grab the column names
	3.  replace the headers in the full version with the correct column names

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
> Using your first approach, this should be faster 
> 
> raw <- readLines(con=filename)
> dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)],
header=FALSE)
> 
> ----------------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
> 
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Fisher Dennis
>> Sent: Wednesday, November 28, 2012 11:43 AM
>> To: r-help at r-project.org
>> Subject: [R] Speeding reading of large file
>> 
>> R 2.15.1
>> OS X and Windows
>> 
>> Colleagues,
>> 
>> I have a file that looks that this:
>> TABLE NO.  1
>> PTID        TIME        AMT         FORM        PERIOD      IPRED
>> CWRES       EVID        CP          PRED        RES         WRES
>>  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00
>>  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00
>> 
>> The file is reasonably large (> 10^6 lines) and the two line header
is
>> repeated periodically in the file.
>> I need to read this file in as a data frame.  Note that the number of
>> columns, the column headers, and the number of replicates of the
>> headers are not known in advance.
>> 
>> I have tried two approaches to this:
>> 	First Approach:
>> 		1.  readLines(FILENAME) to read in the file
>> 		2.  use grep to find the repeat headers; strip out the
>> repeat headers
>> 		3.  write() the object to tempfile, read in that temporary
>> file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
>> to use textConnection but that does not appear to speed things]
>> 
>> 	Second Approach:
>> 		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1,
>> fill=TRUE, as.is=TRUE)
>> 		2.  get rid of the errant entries with:
>> 			TEMP[!is.na(as.numeric(TEMP[,1])),]
>> 		3.  reading of the character entries forced all columns to
>> character mode.  Therefore, I convert each column to numeric:
>> 			for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
>> as.numeric(TEMP[,COL])
>> The second approach is ~ 20% faster than the first.  With the second
>> approach, the conversion to numeric occupies 50% of the elapsed time.
>> 
>> Is there some approach that would be much faster?  For example, would a
>> vectorized approach to conversion to numeric improve throughput?  Or,
>> is there some means to ensure that all data are read as numeric (I
>> tried to use colClasses but that triggered an error when the text
>> string was encountered).
>> 
>> ############################
>> A dput version of the data is:
>> c("TABLE NO.  1", " PTID        TIME        AMT        
FORM
>> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> RES         WRES",
>> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00"
>> )
>> 
>> This can be assembled into a large dataset and written to a file named
>> FILENAME with the following code:
>> cat(c("TABLE NO.  1", " PTID        TIME        AMT     
FORM
>> PERIOD      IPRED       CWRES       EVID        CP          PRED
>> RES         WRES",
>> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
>> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00
>> 0.0000E+00",
>> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
>> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00
>> 0.0000E+00"
>> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>> 
>> 
>> Dennis
>> 
>> 
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

Apparently Analagous Threads

Search for more maybe matching threads

R help - Nov 2012 - Speeding reading of large file

[R] Speeding reading of large file

[R] Speeding reading of large file

[R] Speeding reading of large file

[R] Speeding reading of large file

Apparently Analagous Threads