thr3ads.net - R help - [R] Memory Issue [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Cuckovic Paik

2010-Aug-23 21:44 UTC

[R] Memory Issue

Dear All,

I have an issue on memory use in R programming. 

Here is the brief story: I want to simulate the power of a nonparameteric
test and compare it with the existing tests. The basic steps are

1. I need to use Newton method to obtain the nonparametric MLE that involves
the inversion of a large matrix (n-by-n matrix, it takes about less than 3
seconds in average to get the MLE. n = sample size)


2. Since the test statistic has an unknown sample distribution, the p-value
is simmulated using Monte Carlo (1000 runs). it takes about 3-4 minutes to
get an p-value.


3. I need to simulate 1000 random samples and reapte steps 1 and 2 to get
the p-value for each of the simulated samples to get the power of the test.


Here is the question:

It initially completes 5-6 simulations per hour, after that, the time needed
to complete a single simulation increases exponentially. After a 24 hour
running, I only get about 15-20 simulations completed. My computer is a PC
(Pentium Dual Core CPU 2.5 GHz, RAM 6.00GB, 64-bit). Appearently, the memory
is the problem. 

I also tried various memory re-allocation procedures, They didn't work. Can
anyboy help on this? Thanks in advance.


-- 
View this message in context:
http://r.789695.n4.nabble.com/Memory-Issue-tp2335860p2335860.html
Sent from the R help mailing list archive at Nabble.com.

Dennis Murphy

2010-Aug-23 23:28 UTC

head link

[R] Memory Issue

Hi:

Are you running 32-bit or 64-bit R? For memory-intensive processes like
these, 64-bit R is almost a necessity. You might also look into more
efficient ways to invert the matrix, especially if it has special properties
that can be exploited (e.g., symmetry). More to the point, you want to
compute the nonparametric MLE as efficiently as you can, since it affects
everything downstream. In addition, if you're trying to do all of this in a
single function, it may be better to break the job up into several
functions, one for each task, with a wrapper function to put them together
(i.e., modularize).

Memory problems in R often arise from repeatedly copying objects in memory
while accumulating promises in a loop that do not get evaluated until the
end. Forcing evaluations or performing garbage collection at judicious
points can improve efficiency. Pre-allocating memory to result objects is
more efficient than adding a new element to an output vector or matrix every
iteration. Vectorizing where you can is critical.

Since you didn't provide any code, one is left to speculate where the
bottleneck(s) in your code lie(s), but here's a little example I did for
someone recently that shows how much vectorization and pre-allocation of
memory can make a difference:

# Problem: Simulate 1000 U(0, 1) random numbers, discretize them
# into a factor and generate a table.

# vectorized version using cut()
f <- function() {
   x <- runif(1000)
   z <- cut(x, breaks = c(-0.1, 0.1, 0.2, 0.4, 0.7, 0.9, 1), labels = 1:6)
   table(z)
  }

# use ifelse(), a vectorized function, to divide into groups
g <- function() {
    x <- runif(1000)
    z <- ifelse(x <= 0.1, '1', ifelse(x > 0.1 & x <=
0.2, '2',
                             ifelse(x > 0.2 & x <= 0.4, '3',
                              ifelse(x > 0.4 & x <= 0.7, '4',
                               ifelse(x > 0.7 & x <= 0.9, '5',
'6')))))
    table(z)
  }

# Elementwise loop with preallocation of memory
h <- function() {
     x <- runif(1000)
     z <- character(1000)   #  <=  for(i in 1:1000) {
   z[i] <- if(x[i] <= 0.1) '1' else
           if(x[i] > 0.1 && x[i] <= 0.2) '2' else
           if(x[i] > 0.2 && x[i] <= 0.4) '3' else
           if(x[i] > 0.4 && x[i] <= 0.7) '4' else
           if(x[i] > 0.7 && x[i] <= 0.9) '5' else
'6'
    }
  table(z)
  }

# Same as h() w/o memory preallocation
h2 <- function() {
     x <- runif(1000)
  for(i in 1:1000) {
   z[i] <- if(x[i] <= 0.1) '1' else
           if(x[i] > 0.1 && x[i] <= 0.2) '2' else
           if(x[i] > 0.2 && x[i] <= 0.4) '3' else
           if(x[i] > 0.4 && x[i] <= 0.7) '4' else
           if(x[i] > 0.7 && x[i] <= 0.9) '5' else
'6'
    }
  table(z)
  }

# Same as h(), but initialize with an empty vector
h3 <- function() {
     x <- runif(1000)
     z <- character(0)    # empty vector
  for(i in 1:1000) {
   z[i] <- if(x[i] <= 0.1) '1' else
           if(x[i] > 0.1 && x[i] <= 0.2) '2' else
           if(x[i] > 0.2 && x[i] <= 0.4) '3' else
           if(x[i] > 0.4 && x[i] <= 0.7) '4' else
           if(x[i] > 0.7 && x[i] <= 0.9) '5' else
'6'
    }
  table(z)
  }

########## Timings using the function replicate():
> system.time(replicate(1000, f()))   user  system elapsed
   1.14    0.04    1.20> system.time(replicate(1000, g()))   user  system elapsed
   3.90    0.00    3.92> system.time(replicate(1000, h()))   user  system elapsed
   9.24    0.00    9.26> system.time(replicate(1000, h2()))   user  system elapsed
  15.49    0.00   15.55> system.time(replicate(1000, h3()))   user  system elapsed
  15.60    0.03   15.68

The vectorized version is over three times as fast as the vectorized
ifelse() approach, and the vectorized ifelse() is almost three times as fast
as the preallocated memory, non-vectorized approach. The h* functions are
all non-vectorized, but differ in the way they initialize memory for output
objects. Full preallocation of memory (h) takes about 60%  as long as the
non-preallocated memory versions. Initializing an empty vector is about as
fast as no initialization at all. The effects of vectorization and the use
of pre-allocated memory for result objects filled in a loop are clear.

If you're carrying around copies of a large n x n matrix in memory over a
number of iterations of a loop, you are certainly going to gobble up
available memory, no matter how much you have. You can see the result in a
much simpler problem above. I'd recommend that you invest some time
improving the efficiency of the MLE function. Profiling tools like Rprof()
is one place to start - you can find tutorial material on the web in various
places on the topic (try Googling 'Profiling R functions'), as well as
some
past discussion in this forum. Use RSiteSearch() and/or search the mail
archives for information there.

HTH,
Dennis

On Mon, Aug 23, 2010 at 2:44 PM, Cuckovic Paik
<cuckovic.paik@gmail.com>wrote:
>
> Dear All,
>
> I have an issue on memory use in R programming.
>
> Here is the brief story: I want to simulate the power of a nonparameteric
> test and compare it with the existing tests. The basic steps are
>
> 1. I need to use Newton method to obtain the nonparametric MLE that
> involves
> the inversion of a large matrix (n-by-n matrix, it takes about less than 3
> seconds in average to get the MLE. n = sample size)
>
>
> 2. Since the test statistic has an unknown sample distribution, the p-value
> is simmulated using Monte Carlo (1000 runs). it takes about 3-4 minutes to
> get an p-value.
>
>
> 3. I need to simulate 1000 random samples and reapte steps 1 and 2 to get
> the p-value for each of the simulated samples to get the power of the test.
>
>
> Here is the question:
>
> It initially completes 5-6 simulations per hour, after that, the time
> needed
> to complete a single simulation increases exponentially. After a 24 hour
> running, I only get about 15-20 simulations completed. My computer is a PC
> (Pentium Dual Core CPU 2.5 GHz, RAM 6.00GB, 64-bit). Appearently, the
> memory
> is the problem.
>
> I also tried various memory re-allocation procedures, They didn't work.
Can
> anyboy help on this? Thanks in advance.
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Memory-Issue-tp2335860p2335860.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Cuckovic Paik

2010-Aug-24 13:40 UTC

head link

[R] Memory Issue

Thanks for constrctive comments. I was very careful when I wrote the code. I
wrote many functions and then wrapped up to get a single function.
Originally, I used optim() to get MLE, it was at least 10 times slower than
the code based on Newton method. I also vectorized all objects whenever
possible. 
-- 
View this message in context:
http://r.789695.n4.nabble.com/Memory-Issue-tp2335860p2336687.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more reasonably related threads

R help - Aug 2010 - Memory Issue

[R] Memory Issue

[R] Memory Issue

[R] Memory Issue

Possibly Parallel Threads