thr3ads.net - R help - [R] for loop performance [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Barth B. Riley

2011-Apr-13 21:55 UTC

[R] for loop performance

Dear list

I am running some simulations in R involving reading in several hundred
datasets, performing some statistics and outputting those statistics to file. I
have noticed that it seems that the time it takes to process of a dataset (or,
say, a set of 100 datasets) seems to take longer as the simulation progresses.
Has anyone else noticed this? I am curious to know if this has to do with how R
processes code in loops or if it might be due to memory usage issues (e.g.,
repeatedly reading data into the same matrix).

Thanks in advance

Barth

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Philipp Pagel

2011-Apr-14 08:31 UTC

head link

[R] for loop performance

> I am running some simulations in R involving reading in several
> hundred datasets, performing some statistics and outputting those
> statistics to file. I have noticed that it seems that the time it
> takes to process of a dataset (or, say, a set of 100 datasets) seems
> to take longer as the simulation progresses.
Reading data, e.g. with read.table can be slow because it does a fair
bit of checking content, guessing data types etc. So I guess the
question is: how is your data stored (files, in what format,
database) and how do you read it into R? 

Once we know this there may be tricks to speed up the data import.
> I am curious to know if this has to do with how R processes
> code in loops or if it might be due to memory usage issues (e.g.,
> repeatedly reading data into the same matrix).
Probalby not - I would guess it's the parsing of the input data that
is slow.

cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl f?r Genomorientierte Bioinformatik
Technische Universit?t M?nchen
Wissenschaftszentrum Weihenstephan
Maximus-von-Imhof-Forum 3
85354 Freising, Germany
http://webclu.bio.wzw.tum.de/~pagel/

Barth B. Riley

2011-Apr-14 11:50 UTC

head link

[R] for loop performance

Thank you Phillip for your post. I am reading in:

1. a 3 x 100 item parameter file (floating point and integer data)
2. a 100 x 1000 item response file (integer data)
3. a 6 x 1000 person parameter file (contains simulation condition information,
person measures)

4. I am then computing several statistics used in subsequent ROC analyses, the
AUCs being stored in a 6000 x 15 matrix of floating point numbers

I am using read.table for #1-#3 and write.table for #4. The process of reading
files (#1-#3) and writing to file is done over 6,000 iterations.

Barth

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Philipp Pagel

2011-Apr-14 13:14 UTC

head link

[R] for loop performance

On Thu, Apr 14, 2011 at 06:50:56AM -0500, Barth B. Riley
wrote:> 
> Thank you Phillip for your post. I am reading in:
> 
> 1. a 3 x 100 item parameter file (floating point and integer data)
> 2. a 100 x 1000 item response file (integer data)
> 3. a 6 x 1000 person parameter file (contains simulation condition
> information, person measures)
> 
> 4. I am then computing several statistics used in subsequent ROC
> analyses, the AUCs being stored in a 6000 x 15 matrix of floating
> point numbers
> 
> I am using read.table for #1-#3 and write.table for #4. The process
> of reading files (#1-#3) and writing to file is done over 6,000
> iterations.
A few ideas:

1) try to use the colClasses argument to read.table. That way R will
not have to guess the data type of columns.

2) When you say 6000 iterations - do you mean you are reading/writing the SAME
files over and over again? Or do you have 6000 sets of files? In the
former case the obvious advice would be to only read them once.

3) If the input files were generated in R, another option would be to
save()/load() them rather than using write.table()/read.table(). 

4) If the came from some other application, possibly storing
everything in a database may speed up things.

5) Is your data on a file server? If yes: try moving it to the local
disc temporarily to see if network i/o is limiting your speed.

6) Whatever you try to improve performance - measure the effects
rather than rely on your impression (system.time, Rprof, ...) in order
to find out what part of the program is actually eating up the most
time.

cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl f?r Genomorientierte Bioinformatik
Technische Universit?t M?nchen
Wissenschaftszentrum Weihenstephan
Maximus-von-Imhof-Forum 3
85354 Freising, Germany
http://webclu.bio.wzw.tum.de/~pagel/

Martin Morgan

2011-Apr-14 13:57 UTC

head link

[R] for loop performance

On 04/13/2011 02:55 PM, Barth B. Riley wrote:> Dear list
>
> I am running some simulations in R involving reading in several
> hundred datasets, performing some statistics and outputting those
> statistics to file. I have noticed that it seems that the time it
> takes to process of a dataset (or, say, a set of 100 datasets) seems
> to take longer as the simulation progresses. Has anyone else noticed
> this? I am curious to know if this has to do with how R processes
> code in loops or if it might be due to memory usage issues (e.g.,
> repeatedly reading data into the same matrix).
Hi Barth

The 'it gets slower' symptom is often due to repeatedly 'growing by
1' a
list or  other data structure, e.g.,

   m = matrix(100000, 100)
   n = 20000
   result = list()
   system.time(for (i in seq_len(n)) result[[i]] = m)

versus 'pre-allocate and fill'

    result = vector("list", n)
    system.time(for (i in seq_len(n)) result[[i]] = m)

The former causes 'result' to be copied on each new assignment, and the 
size of the copy gets larger each time.
>
> Thanks in advance
>
> Barth
>
> PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any
> attachments may contain PRIVILEGED AND CONFIDENTIAL information and
> is intended only for the use of the addressee. If you are not the
> designated recipient, or an employee or agent authorized to deliver
> such transmittals to the designated recipient, you are hereby
> notified that any dissemination, copying or publication of this
> transmittal is strictly prohibited. If you have received this
> transmittal in error, please notify us immediately by replying to the
> sender and delete this copy from your system. You may also call us at
> (309) 827-6026 for assistance.
>
> ______________________________________________ R-help at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

Barth B. Riley

2011-Apr-14 16:11 UTC

head link

[R] for loop performance

Thanks Martin,  this is very helpful.
Barth

-----Original Message-----
From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
Sent: Thursday, April 14, 2011 11:04 AM
To: Barth B. Riley
Subject: Re: [R] for loop performance

On 04/14/2011 07:12 AM, Barth B. Riley wrote:> Hi Martin
>
> Question--when variables are defined within a function (and not
> returned by the function) the memory they used is deallocated when the
> function returns, correct? That would seem to make sense but since my
> R code is not compiled, I'm not sure how R handles local variables
> within a function.
The answer depends on what the function returns, and perhaps other things. For
instance

f = function() { a = 1; b = 2; a }
x = f()

'a' and 'b' created inside f() will be garbage collected; there
is no way to access their value after f() returns. But in

g = function() { a = 1; function() {} }
y = g()

(i.e., g a function that returns a function) y is now a function, functions have
environments (the one in which they were created), and the environment can
accessed, so 'a' is still accessible

 > environment(y)[["a"]]
[1] 1

and is not available for garbage collection until y is removed.

Martin
>
> Barth
>
>
> -----Original Message----- From: Martin Morgan
> [mailto:mtmorgan at fhcrc.org] Sent: Thursday, April 14, 2011 8:57 AM
> To: Barth B. Riley Cc: r-help at r-project.org Subject: Re: [R] for loop
> performance
>
> On 04/13/2011 02:55 PM, Barth B. Riley wrote:
>> Dear list
>>
>> I am running some simulations in R involving reading in several
>> hundred datasets, performing some statistics and outputting those
>> statistics to file. I have noticed that it seems that the time it
>> takes to process of a dataset (or, say, a set of 100 datasets) seems
>> to take longer as the simulation progresses. Has anyone else noticed
>> this? I am curious to know if this has to do with how R processes
>> code in loops or if it might be due to memory usage issues (e.g.,
>> repeatedly reading data into the same matrix).
>
> Hi Barth
>
> The 'it gets slower' symptom is often due to repeatedly
'growing by 1'
> a list or  other data structure, e.g.,
>
> m = matrix(100000, 100) n = 20000 result = list() system.time(for (i
> in seq_len(n)) result[[i]] = m)
>
> versus 'pre-allocate and fill'
>
> result = vector("list", n) system.time(for (i in seq_len(n))
> result[[i]] = m)
>
> The former causes 'result' to be copied on each new assignment, and
> the size of the copy gets larger each time.
>
>>
>> Thanks in advance
>>
>> Barth
>>
>> PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any
>> attachments may contain PRIVILEGED AND CONFIDENTIAL information and
>> is intended only for the use of the addressee. If you are not the
>> designated recipient, or an employee or agent authorized to deliver
>> such transmittals to the designated recipient, you are hereby
>> notified that any dissemination, copying or publication of this
>> transmittal is strictly prohibited. If you have received this
>> transmittal in error, please notify us immediately by replying to the
>> sender and delete this copy from your system. You may also call us at
>> (309) 827-6026 for assistance.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
>> posting guide http://www.R-project.org/posting-guide.html and provide
>> commented, minimal, self-contained, reproducible code.
>
>
> -- Computational Biology Fred Hutchinson Cancer Research Center 1100
> Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861 Telephone: 206 667-2793
>
> PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any
> attachments may contain PRIVILEGED AND CONFIDENTIAL information and is
> intended only for the use of the addressee. If you are not the
> designated recipient, or an employee or agent authorized to deliver
> such transmittals to the designated recipient, you are hereby notified
> that any dissemination, copying or publication of this transmittal is
> strictly prohibited. If you have received this transmittal in error,
> please notify us immediately by replying to the sender and delete this
> copy from your system. You may also call us at
> (309) 827-6026 for assistance.

--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Apr 2011 - for loop performance

[R] for loop performance

[R] for loop performance

[R] for loop performance

[R] for loop performance

[R] for loop performance

[R] for loop performance

Apparently Analagous Threads