thr3ads.net - R help - [R] read big text file into R [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Yupu Liang

2007-Aug-23 18:29 UTC

[R] read big text file into R

Dear Rs:

Hi, I am trying to read a big text file (nrows=243440, ncols=144). It  
seems the computational time of all the read methods 
(scan,readtable,read.delim) is not linear to the number of rows I  
want to read in: things became really slow once I tried to read in  
100000 lines compare to 10000 lines).

If I am reading the profiling result right, I guess scan wouldn't  
help either.

My questions are :
1) Is this a memory issue?
2) How to get around this?: I can't just sit around for 15 mins.  
Would write a c function help?

Thanks!

Here is the profiling I did:

 > Rprof()
 > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=10000)
 > Rprof(NULL)
 > summaryRprof()
$by.self
                self.time self.pct total.time total.pct
"scan"              3.56     85.2       3.56      85.2
"type.convert"      0.48     11.5       0.48      11.5
"read.table"        0.08      1.9       4.18     100.0
"make.names"        0.02      0.5       0.02       0.5
"options"           0.02      0.5       0.02       0.5
"readLines"         0.02      0.5       0.02       0.5
"read.delim"        0.00      0.0       4.18     100.0
"file"              0.00      0.0       0.02       0.5
"getOption"         0.00      0.0       0.02       0.5

$by.total
                total.time total.pct self.time self.pct
"read.table"         4.18     100.0      0.08      1.9
"read.delim"         4.18     100.0      0.00      0.0
"scan"               3.56      85.2      3.56     85.2
"type.convert"       0.48      11.5      0.48     11.5
"make.names"         0.02       0.5      0.02      0.5
"options"            0.02       0.5      0.02      0.5
"readLines"          0.02       0.5      0.02      0.5
"file"               0.02       0.5      0.00      0.0
"getOption"          0.02       0.5      0.00      0.0

$sampling.time
[1] 4.18

 > ?Rprof()
 > Rprof()
 > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=100000)
 > Rprof(NULL)
 > summaryRprof()
$by.self
                  self.time self.pct total.time total.pct
"scan"              143.12     92.7     143.12      92.7
"type.convert"        9.52      6.2       9.52       6.2
"read.table"          1.60      1.0     154.28      99.9
"paste"               0.02      0.0       0.08       0.1
"textConnection"      0.02      0.0       0.04       0.0
".deparseOpts"        0.02      0.0       0.02       0.0
"file"                0.02      0.0       0.02       0.0
"make.names"          0.02      0.0       0.02       0.0
"print.default"       0.02      0.0       0.02       0.0
"read.delim"          0.00      0.0     154.28      99.9
"doTryCatch"          0.00      0.0       0.08       0.1
"gsub"                0.00      0.0       0.08       0.1
"try"                 0.00      0.0       0.08       0.1
"tryCatch"            0.00      0.0       0.08       0.1
"tryCatchList"        0.00      0.0       0.08       0.1
"tryCatchOne"         0.00      0.0       0.08       0.1
"capture.output"      0.00      0.0       0.06       0.0
"deparse"             0.00      0.0       0.02       0.0
"eval.with.vis"       0.00      0.0       0.02       0.0
"evalVis"             0.00      0.0       0.02       0.0
"print"               0.00      0.0       0.02       0.0

$by.total
                  total.time total.pct self.time self.pct
"read.table"         154.28      99.9      1.60      1.0
"read.delim"         154.28      99.9      0.00      0.0
"scan"               143.12      92.7    143.12     92.7
"type.convert"         9.52       6.2      9.52      6.2
"paste"                0.08       0.1      0.02      0.0
"doTryCatch"           0.08       0.1      0.00      0.0
"gsub"                 0.08       0.1      0.00      0.0
"try"                  0.08       0.1      0.00      0.0
"tryCatch"             0.08       0.1      0.00      0.0
"tryCatchList"         0.08       0.1      0.00      0.0
"tryCatchOne"          0.08       0.1      0.00      0.0
"capture.output"       0.06       0.0      0.00      0.0
"textConnection"       0.04       0.0      0.02      0.0
".deparseOpts"         0.02       0.0      0.02      0.0
"file"                 0.02       0.0      0.02      0.0
"make.names"           0.02       0.0      0.02      0.0
"print.default"        0.02       0.0      0.02      0.0
"deparse"              0.02       0.0      0.00      0.0
"eval.with.vis"        0.02       0.0      0.00      0.0
"evalVis"              0.02       0.0      0.00      0.0
"print"                0.02       0.0      0.00      0.0

$sampling.time
[1] 154.36

I am using R 2.5.1 for mac on a Dual 2 GHz PowerPC G5 with 1GB memory.

Gabor Grothendieck

2007-Aug-23 19:21 UTC

head link

[R] read big text file into R

Another option is to read it into a database and from there into R.
RSQLite has the capability of reading certain text files directly into
an SQLite database without going through R and from there one
can read it into R.   You can use RSQLite to do that.  Alternately this
post describes how the devel version of the sqldf package can do it:

http://www.nabble.com/Re%3A-Memory-Experimentation%3A-Rule-of-Thumb-%3D-10-15-Times-the-Memory-p12078165.html

On 8/23/07, Yupu Liang <liang at cbio.mskcc.org>
wrote:> Dear Rs:
>
> Hi, I am trying to read a big text file (nrows=243440, ncols=144). It
> seems the computational time of all the read methods
> (scan,readtable,read.delim) is not linear to the number of rows I
> want to read in: things became really slow once I tried to read in
> 100000 lines compare to 10000 lines).
>
> If I am reading the profiling result right, I guess scan wouldn't
> help either.
>
> My questions are :
> 1) Is this a memory issue?
> 2) How to get around this?: I can't just sit around for 15 mins.
> Would write a c function help?
>
> Thanks!
>
> Here is the profiling I did:
>
>  > Rprof()
>  > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=10000)
>  > Rprof(NULL)
>  > summaryRprof()
> $by.self
>                self.time self.pct total.time total.pct
> "scan"              3.56     85.2       3.56      85.2
> "type.convert"      0.48     11.5       0.48      11.5
> "read.table"        0.08      1.9       4.18     100.0
> "make.names"        0.02      0.5       0.02       0.5
> "options"           0.02      0.5       0.02       0.5
> "readLines"         0.02      0.5       0.02       0.5
> "read.delim"        0.00      0.0       4.18     100.0
> "file"              0.00      0.0       0.02       0.5
> "getOption"         0.00      0.0       0.02       0.5
>
> $by.total
>                total.time total.pct self.time self.pct
> "read.table"         4.18     100.0      0.08      1.9
> "read.delim"         4.18     100.0      0.00      0.0
> "scan"               3.56      85.2      3.56     85.2
> "type.convert"       0.48      11.5      0.48     11.5
> "make.names"         0.02       0.5      0.02      0.5
> "options"            0.02       0.5      0.02      0.5
> "readLines"          0.02       0.5      0.02      0.5
> "file"               0.02       0.5      0.00      0.0
> "getOption"          0.02       0.5      0.00      0.0
>
> $sampling.time
> [1] 4.18
>
>  > ?Rprof()
>  > Rprof()
>  > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=100000)
>  > Rprof(NULL)
>  > summaryRprof()
> $by.self
>                  self.time self.pct total.time total.pct
> "scan"              143.12     92.7     143.12      92.7
> "type.convert"        9.52      6.2       9.52       6.2
> "read.table"          1.60      1.0     154.28      99.9
> "paste"               0.02      0.0       0.08       0.1
> "textConnection"      0.02      0.0       0.04       0.0
> ".deparseOpts"        0.02      0.0       0.02       0.0
> "file"                0.02      0.0       0.02       0.0
> "make.names"          0.02      0.0       0.02       0.0
> "print.default"       0.02      0.0       0.02       0.0
> "read.delim"          0.00      0.0     154.28      99.9
> "doTryCatch"          0.00      0.0       0.08       0.1
> "gsub"                0.00      0.0       0.08       0.1
> "try"                 0.00      0.0       0.08       0.1
> "tryCatch"            0.00      0.0       0.08       0.1
> "tryCatchList"        0.00      0.0       0.08       0.1
> "tryCatchOne"         0.00      0.0       0.08       0.1
> "capture.output"      0.00      0.0       0.06       0.0
> "deparse"             0.00      0.0       0.02       0.0
> "eval.with.vis"       0.00      0.0       0.02       0.0
> "evalVis"             0.00      0.0       0.02       0.0
> "print"               0.00      0.0       0.02       0.0
>
> $by.total
>                  total.time total.pct self.time self.pct
> "read.table"         154.28      99.9      1.60      1.0
> "read.delim"         154.28      99.9      0.00      0.0
> "scan"               143.12      92.7    143.12     92.7
> "type.convert"         9.52       6.2      9.52      6.2
> "paste"                0.08       0.1      0.02      0.0
> "doTryCatch"           0.08       0.1      0.00      0.0
> "gsub"                 0.08       0.1      0.00      0.0
> "try"                  0.08       0.1      0.00      0.0
> "tryCatch"             0.08       0.1      0.00      0.0
> "tryCatchList"         0.08       0.1      0.00      0.0
> "tryCatchOne"          0.08       0.1      0.00      0.0
> "capture.output"       0.06       0.0      0.00      0.0
> "textConnection"       0.04       0.0      0.02      0.0
> ".deparseOpts"         0.02       0.0      0.02      0.0
> "file"                 0.02       0.0      0.02      0.0
> "make.names"           0.02       0.0      0.02      0.0
> "print.default"        0.02       0.0      0.02      0.0
> "deparse"              0.02       0.0      0.00      0.0
> "eval.with.vis"        0.02       0.0      0.00      0.0
> "evalVis"              0.02       0.0      0.00      0.0
> "print"                0.02       0.0      0.00      0.0
>
> $sampling.time
> [1] 154.36
>
> I am using R 2.5.1 for mac on a Dual 2 GHz PowerPC G5 with 1GB memory.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry

2007-Aug-23 19:46 UTC

head link

[R] read big text file into R

On Thu, 23 Aug 2007, Yupu Liang wrote:
> Dear Rs:
>
> Hi, I am trying to read a big text file (nrows=243440, ncols=144). It
> seems the computational time of all the read methods
> (scan,readtable,read.delim) is not linear to the number of rows I
> want to read in: things became really slow once I tried to read in
> 100000 lines compare to 10000 lines).
What did 'top' or gc() tell you about memory use, paging, and swap 
behavior?

Likely you are starting to swap when you get to nrows=100000.

Even if your data were all numeric, over 100MB would be needed just to 
store them. And reading them in requires more memory still.

>
> If I am reading the profiling result right, I guess scan wouldn't
> help either.
>
> My questions are :
> 1) Is this a memory issue?
> 2) How to get around this?: I can't just sit around for 15 mins.
> Would write a c function help?
>
> Thanks!
>
> Here is the profiling I did:
>
> > Rprof()
> > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=10000)
> > Rprof(NULL)
> > summaryRprof()
> $by.self
>                self.time self.pct total.time total.pct
> "scan"              3.56     85.2       3.56      85.2
> "type.convert"      0.48     11.5       0.48      11.5
> "read.table"        0.08      1.9       4.18     100.0
> "make.names"        0.02      0.5       0.02       0.5
> "options"           0.02      0.5       0.02       0.5
> "readLines"         0.02      0.5       0.02       0.5
> "read.delim"        0.00      0.0       4.18     100.0
> "file"              0.00      0.0       0.02       0.5
> "getOption"         0.00      0.0       0.02       0.5
>
> $by.total
>                total.time total.pct self.time self.pct
> "read.table"         4.18     100.0      0.08      1.9
> "read.delim"         4.18     100.0      0.00      0.0
> "scan"               3.56      85.2      3.56     85.2
> "type.convert"       0.48      11.5      0.48     11.5
> "make.names"         0.02       0.5      0.02      0.5
> "options"            0.02       0.5      0.02      0.5
> "readLines"          0.02       0.5      0.02      0.5
> "file"               0.02       0.5      0.00      0.0
> "getOption"          0.02       0.5      0.00      0.0
>
> $sampling.time
> [1] 4.18
>
> > ?Rprof()
> > Rprof()
> > dd = read.delim(file,skip=9,sep="\t",as.is= T,nrows=100000)
> > Rprof(NULL)
> > summaryRprof()
> $by.self
>                  self.time self.pct total.time total.pct
> "scan"              143.12     92.7     143.12      92.7
> "type.convert"        9.52      6.2       9.52       6.2
> "read.table"          1.60      1.0     154.28      99.9
> "paste"               0.02      0.0       0.08       0.1
> "textConnection"      0.02      0.0       0.04       0.0
> ".deparseOpts"        0.02      0.0       0.02       0.0
> "file"                0.02      0.0       0.02       0.0
> "make.names"          0.02      0.0       0.02       0.0
> "print.default"       0.02      0.0       0.02       0.0
> "read.delim"          0.00      0.0     154.28      99.9
> "doTryCatch"          0.00      0.0       0.08       0.1
> "gsub"                0.00      0.0       0.08       0.1
> "try"                 0.00      0.0       0.08       0.1
> "tryCatch"            0.00      0.0       0.08       0.1
> "tryCatchList"        0.00      0.0       0.08       0.1
> "tryCatchOne"         0.00      0.0       0.08       0.1
> "capture.output"      0.00      0.0       0.06       0.0
> "deparse"             0.00      0.0       0.02       0.0
> "eval.with.vis"       0.00      0.0       0.02       0.0
> "evalVis"             0.00      0.0       0.02       0.0
> "print"               0.00      0.0       0.02       0.0
>
> $by.total
>                  total.time total.pct self.time self.pct
> "read.table"         154.28      99.9      1.60      1.0
> "read.delim"         154.28      99.9      0.00      0.0
> "scan"               143.12      92.7    143.12     92.7
> "type.convert"         9.52       6.2      9.52      6.2
> "paste"                0.08       0.1      0.02      0.0
> "doTryCatch"           0.08       0.1      0.00      0.0
> "gsub"                 0.08       0.1      0.00      0.0
> "try"                  0.08       0.1      0.00      0.0
> "tryCatch"             0.08       0.1      0.00      0.0
> "tryCatchList"         0.08       0.1      0.00      0.0
> "tryCatchOne"          0.08       0.1      0.00      0.0
> "capture.output"       0.06       0.0      0.00      0.0
> "textConnection"       0.04       0.0      0.02      0.0
> ".deparseOpts"         0.02       0.0      0.02      0.0
> "file"                 0.02       0.0      0.02      0.0
> "make.names"           0.02       0.0      0.02      0.0
> "print.default"        0.02       0.0      0.02      0.0
> "deparse"              0.02       0.0      0.00      0.0
> "eval.with.vis"        0.02       0.0      0.00      0.0
> "evalVis"              0.02       0.0      0.00      0.0
> "print"                0.02       0.0      0.00      0.0
>
> $sampling.time
> [1] 154.36
>
> I am using R 2.5.1 for mac on a Dual 2 GHz PowerPC G5 with 1GB memory.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Aug 2007 - read big text file into R

[R] read big text file into R

[R] read big text file into R

[R] read big text file into R

Possibly Parallel Threads