Dear Jim,
Thanks for telling me about gc() - that may help.
Here's some more detail on my function. It's to implement a methodology
for classifying plant communities into functional groups according to the values
of species traits, and comparing the usefulness of alternative classifications
by using them to summarise species abundance data and then correlating
site-differences based on these abundance data with site-differences based on
environmental variables. The method is described in Pillar et al (2009), J.
Vegetation Science 20: 334-348.
First the function produces a set of classifications of species by applying the
function agnes() from
the library "cluster" to all possible combinations of the variables of
a species-by-traits dataframe Q. It also prepares a distance matrix dR based on
environmental variables for the same sites. Then there is a loop that takes
each ith classification in turn and summarises a raw-data dataframe of species
abundances into a shorter dataframe Xi, by
grouping clusters of its rows according to the classification. It then
calculates a distance matrix (dXi) based on this summary of abundances, and
another distance matrix (dQi) based on corresponding variables of the matrix Q
directly. Finally in the loop, mantel.partial() from the library
"vegan" is used to run a partial mantel
test between dXi and dR,
conditioned on dQi. The argument
"permutations" is set to zero, and only the mantel statistic is
stored.
The loop also contains a forward stepwise selection procedure so that not all
classifications are actually used. After all classifications using (e.g.) a
single variable have been used, the variable(s) involved in the best
classification(s) are specified for inclusion in the next cycle of the loop. I
wonder how lucid that all was...
I began putting together the main parts of the code, but I fear it takes so much
explanation (not to mention editing for transparency) that it may not be worth
the effort unless someone is really committed to following it through - it's
about 130 lines in total. I could still do this if it's likely to be
worthwhile...
However, the stepwise procedure was only fully implemented after I sent my first
email. Now that none of the iterative output is stored except the final mantel
statistic and essential records of which classifications were used, the memory
demand has decreased. The problem that now presents itself is simply that the
function still takes a very long time to run (e.g. 10 hours to go through 7
variables stepwise, with distance matrices of dimension 180 or so).
Two parts of the code that feel clumsy to me already are:
unstack(stack(by(X,f,colSums))) # to reduce a dataframe X to a dataframe with
fewer rows by summing within sets of rows defined by the factor f
and
V <- list(); for(i in 1:n) V[[i]] <- grep(pattern[i], x); names(V) <-
1:q; V <- stack(V); V[,1] # to get the indices of multiple matches from x
(which is a vector of variable names some of which may be repeated) for each of
several character strings in pattern
The code also involves relating columns of dataframes to each
other using character-matching - e.g. naming columns with paste() -ed strings of
all the variables used to create them, and then subsequently strsplit() -ing
these names so that columns can be selected that contain any of the
specified variable names.
I'm grateful for any advice!
Thanks, Richard.
________________________________
From: jim holtman <jholtman@gmail.com>
To: Richard Gunton <r.gunton@talk21.com>
Sent: Tuesday, 8 September, 2009 2:08:51 PM
Subject: Re: [R] How to reduce memory demands in a function?
Can you at least post what the function is doing and better yet,
provide commented, minimal, self-contained, reproducible code. You
can put call to memory.size() to see how large things are growing,
delete temporary objects when not needed, make calls to gc(), ... ,
but it is hard to tell what without an example.
On Mon, Sep 7, 2009 at 4:16 AM, Richard Gunton<r..gunton@talk21.com>
wrote:>
I've written a function that regularly throws the "cannot allocate
vector of size X Kb" error, since it contains a loop that creates large
numbers of big distance matrices. I'd be very grateful for any simple advice
on how to reduce the memory demands of my function. Besides increasing
memory.size to the maximum available, I've tried reducing my
"dist" objects to 3 sig. fig.s (not sure if that made any difference),
I've tried the distance function daisy() from package "cluster"
instead of dist(), and I've avoided storing unnecessary intermediary objects
as far as possible by nesting functions in the same command. I've even
tried writing each of my dist() objects to a text file, one line for each, and
reading them in again one at a time as and when required, using scan() - and
although this seemed to avoid the memory problem, it ran so slowly that it
wasn't much use for someone with deadlines to meet...
I don't have formal training in programming, so if there's something
handy I should read, do let me know.
Thanks,
Richard Gunton.
Postdoctoral researcher in arable weed ecology, INRA Dijon.
[[alternative HTML version deleted]]