On 12-04-01 2:58 AM, baptiste auguie wrote:> Dear list,
>
> I am trying to find a fast solution to read moderately large (1 -- 10
> million entries) text files containing only tab-delimited numeric
> values. My test file is the following,
>
> nr<- 1000
> nc<- 5000
>
> m<- matrix(round(rnorm(nr*nc),3),nr=nr)
> write.table(m, file = "a.txt", append=FALSE,
> row.names = FALSE, col.names = FALSE)
>
>
> scan() is faster than read.table(), as expected, but still quite slow
> compared to Matlab for example. Based on archived discussions on this
> list and Stack Overflow, I tried readChar(); it's really fast.
> However, it returns a long character string, where I really want
> numeric values. I can use as.numeric(strsplit()), but to my complete
> surprise it is faster to run scan() on this text string. Consider the
> following comparison (I use the command line wc to optimize the memory
> allocation),
Tell it the types of the columns, and it will go a bit faster.
Duncan Murdoch
>
> load_file1<- function(f){
> ## ask wc the number of words
> n<- scan(textConnection(system(paste("wc -w ", f),
intern=TRUE)),
> what=list(integer(), character()), quiet=TRUE)[[1]]
> all<- scan(f, nmax=n, quiet=TRUE)
> invisible(all)
> }
>
> load_file2<- function(f){
> ## ask wc the number of characters
> n<- scan(textConnection(system(paste("wc -m ", f),
intern=TRUE)),
> what=list(integer(), character()), quiet=TRUE)[[1]]
> tc<- textConnection(readChar(f, n))
> all<- scan(tc, quiet=TRUE, multi.line = FALSE)
> close(tc)
> invisible(all)
> }
>
>
> system.time(a<- load_file1("a.txt"))
> ## user system elapsed
> ## 7.805 0.138 8.026
> system.time(b<- load_file2("a.txt"))
> ## user system elapsed
> ## 2.182 0.301 2.538
> all.equal(a, b)
> ##> [1] TRUE
>
>
> Could someone explain to me why it is faster to scan a textConnection
> than the original file? Have I missed a better solution?
>
> Thanks,
>
> baptiste
>
> sessionInfo()
> R version 2.15.0 RC (2012-03-29 r58868)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.