thr3ads.net - R help - [R] read.table performance [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Gene Leynes

2011-Dec-06 18:15 UTC

[R] read.table performance

** Disclaimer: I'm looking for general suggestions **
I'm sorry, but can't send out the file I'm using, so there is no
reproducible example.

I'm using read.table and it's taking over 30 seconds to read a tiny
file.
The strange thing is that it takes roughly the same amount of time if the
file is 100 times larger.

After re-reviewing the data Import / Export manual I think the best
approach would be to use Python, or perhaps the readLines function, but I
was hoping to understand why the simple read.table approach wasn't working
as expected.

Some relevant facts:

   1. There are about 3700 columns.  Maybe this is the problem?  Still the
   file size is not very large.
   2. The file encoding is ANSI, but I'm not specifying that in the
   function.  Setting fileEncoding="ANSI" produces an
"unsupported conversion"
   error
   3. readLines imports the lines quickly
   4. scan imports the file quickly also

Obviously, scan and readLines would require more coding to identify
columns, etc.

my code:
system.time(dat <- read.table('C:/test.txt', nrows=-1,
sep='\t',
header=TRUE))

It's taking 33.4 seconds and the file size is only 315 kb!

Thanks

Gene

	[[alternative HTML version deleted]]

Gabor Grothendieck

2011-Dec-06 19:06 UTC

head link

[R] read.table performance

On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
wrote:> ** Disclaimer: I'm looking for general suggestions **
> I'm sorry, but can't send out the file I'm using, so there is
no
> reproducible example.
>
> I'm using read.table and it's taking over 30 seconds to read a tiny
file.
> The strange thing is that it takes roughly the same amount of time if the
> file is 100 times larger.
>
> After re-reviewing the data Import / Export manual I think the best
> approach would be to use Python, or perhaps the readLines function, but I
> was hoping to understand why the simple read.table approach wasn't
working
> as expected.
>
> Some relevant facts:
>
> ? 1. There are about 3700 columns. ?Maybe this is the problem? ?Still the
> ? file size is not very large.
> ? 2. The file encoding is ANSI, but I'm not specifying that in the
> ? function. ?Setting fileEncoding="ANSI" produces an
"unsupported conversion"
> ? error
> ? 3. readLines imports the lines quickly
> ? 4. scan imports the file quickly also
>
> Obviously, scan and readLines would require more coding to identify
> columns, etc.
>
> my code:
> system.time(dat <- read.table('C:/test.txt', nrows=-1,
sep='\t',
> header=TRUE))
>
> It's taking 33.4 seconds and the file size is only 315 kb!
>
You could also try read.csv.sql in the sqldf package and see whether
or not that is any faster. Be sure you are using RSQLite 0.11.0 (and
not an earlier version) with that since earlier versions were compiled
to work with only a maximum of 999 columns.

library(sqldf)
DF <- read.csv.sql("C:\\test.txt", header = TRUE, sep =
"\t")

You may or may not have to use the eol= argument to specify line
endings.  See ?read.csv.sql

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Gene Leynes

2011-Dec-06 21:33 UTC

head link

[R] read.table performance

Mark,

Thanks for your suggestions.

That's a good idea about the NULL columns; I didn't think of that.
Surprisingly, it didn't have any effect on the time.

This problem was just a curiosity, I already did the import using Excel and
VBA.  I was just going to illustrate the power and simplicity of R, but it
ironically it's been much slower and harder in R...
The VBA was painful and messy, and took me over an hour to write; but at
least it worked quickly and reliably.
The R code was clean and only took me about 5 minutes to write, but the run
time was prohibitively slow!

I profiled the code, but that offers little insight to me.

Profile results with 10 line file:
> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")$by.self
             self.time self.pct total.time total.pct
scan             12.24    53.50      12.24     53.50
read.table       10.58    46.24      22.88    100.00
type.convert      0.04     0.17       0.04      0.17
make.names        0.02     0.09       0.02      0.09

$by.total
             total.time total.pct self.time self.pct
read.table        22.88    100.00     10.58    46.24
scan              12.24     53.50     12.24    53.50
type.convert       0.04      0.17      0.04     0.17
make.names         0.02      0.09      0.02     0.09

$sample.interval
[1] 0.02

$sampling.time
[1] 22.88


Profile results with 250 line file:
> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")$by.self
             self.time self.pct total.time total.pct
scan             23.88    68.15      23.88     68.15
read.table       10.78    30.76      35.04    100.00
type.convert      0.30     0.86       0.32      0.91
character         0.02     0.06       0.02      0.06
file              0.02     0.06       0.02      0.06
lapply            0.02     0.06       0.02      0.06
unlist            0.02     0.06       0.02      0.06

$by.total
               total.time total.pct self.time self.pct
read.table          35.04    100.00     10.78    30.76
scan                23.88     68.15     23.88    68.15
type.convert         0.32      0.91      0.30     0.86
sapply               0.04      0.11      0.00     0.00
character            0.02      0.06      0.02     0.06
file                 0.02      0.06      0.02     0.06
lapply               0.02      0.06      0.02     0.06
unlist               0.02      0.06      0.02     0.06
simplify2array       0.02      0.06      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 35.04




On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2@gmail.com> wrote:
> hi gene: maybe someone else will reply with some  subtleties that I'm
not
> aware of. one other thing
> that might help: if you know which columns you want , you can set the
> others to NULL through
> colClasses and this should speed things up also. For example, say you knew
> you only wanted the
> first four columns and they were character. then you could do,
>
> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
> rep(NULL,3696)).
>
> hopefully someone else will say something that does the trick. it seems
> odd to me as far as the
> difference in timings ? good luck.
>
>
>
>
>
> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes@gmail.com>
wrote:
>
>> Mark,
>>
>> Thank you for the reply
>>
>> I neglected to mention that I had already set
>> options(stringsAsFactors=FALSE)
>>
>> I agree, skipping the factor determination can help performance.
>>
>> The main reason that I wanted to use read.table is because it will
>> correctly determine the column classes for me.  I don't really want
to
>> specify 3700 column classes!  (I'm not sure what they are anyway).
>>
>>
>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds
<markleeds2@gmail.com> wrote:
>>
>>> Hi Gene: Sometimes using colClasses in read.table can speed things
up.
>>> If you know what your variables are ahead of time and what you want
them to
>>> be, this allows you to be specific  by specifying, character or
numeric,
>>> etc  and often it makes things faster. others will have more to
say.
>>>
>>> also, if most of your variables are characters, R will try to turn
>>> convert them into factors by default. If you use as.is = TRUE it
won't
>>> do this and that might speed things up also.
>>>
>>>
>>> Rejoinder:  above tidbits are  just from experience. I don't
know if
>>> it's in stone or a hard and fast rule.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes
<gleynes@gmail.com> wrote:
>>>
>>>> ** Disclaimer: I'm looking for general suggestions **
>>>> I'm sorry, but can't send out the file I'm using,
so there is no
>>>> reproducible example.
>>>>
>>>> I'm using read.table and it's taking over 30 seconds to
read a tiny
>>>> file.
>>>> The strange thing is that it takes roughly the same amount of
time if
>>>> the
>>>> file is 100 times larger.
>>>>
>>>> After re-reviewing the data Import / Export manual I think the
best
>>>> approach would be to use Python, or perhaps the readLines
function, but
>>>> I
>>>> was hoping to understand why the simple read.table approach
wasn't
>>>> working
>>>> as expected.
>>>>
>>>> Some relevant facts:
>>>>
>>>>   1. There are about 3700 columns.  Maybe this is the problem? 
Still
>>>> the
>>>>
>>>>   file size is not very large.
>>>>   2. The file encoding is ANSI, but I'm not specifying that
in the
>>>>
>>>>   function.  Setting fileEncoding="ANSI" produces an
"unsupported
>>>> conversion"
>>>>   error
>>>>   3. readLines imports the lines quickly
>>>>   4. scan imports the file quickly also
>>>>
>>>>
>>>> Obviously, scan and readLines would require more coding to
identify
>>>> columns, etc.
>>>>
>>>> my code:
>>>> system.time(dat <- read.table('C:/test.txt',
nrows=-1, sep='\t',
>>>> header=TRUE))
>>>>
>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>>>>
>>>> Thanks
>>>>
>>>> Gene
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>
>>>
>>
>
	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more maybe matching threads

R help - Dec 2011 - read.table performance

[R] read.table performance

[R] read.table performance

[R] read.table performance

Seemingly Similar Threads