Dear Rexperts,
First of all let me say that R is a wonderful and useful piece of
software.
The only thing is that sometimes it takes me a long time to find out how
something can be done, especially when aiming to write compact (and
efficient) code.
For instance, I have the following function (very rudimentary) which
takes a (very specific) data frame as input and for certain subsets
calculates the rank correlation between two corresponding columns.
The aim is to add all the rank correlations.
<code>
add.fun <- function(perf.data) {
ss <- 0
for (i in 0:29) {
ss <- ss + cor(subset(perf.data, dataset == i)[3],
subset(perf.data, dataset == i)[7], method = "kendall")
}
ss
}
</code>
As one can see this function uses a for-loop. Now chapter 9 of 'An
introduction to R' tells us that we should avoid for-loops as much as
possible.
Is there an obvious way to avoid this for-loop is this case ?
I would like to see something in the lines of
(maple style)
<code>
add( seq(FUN(i), i = 0..29) )
</code>
Greetings
Stijn.
--
=========================================================================Dept.
of Applied Mathematics and Computer Science, University of Ghent
Krijgslaan 281 - S9, B - 9000 Ghent, Belgium
Phone: +32-9-264.48.91, Fax: +32-9-264.49.95
E-mail: Stijn.Lievens at ugent.be, URL: http://allserv.ugent.be/~slievens/
Stijn Lievens wrote:> Dear Rexperts, > > First of all let me say that R is a wonderful and useful piece of software. > > The only thing is that sometimes it takes me a long time to find out how > something can be done, especially when aiming to write compact (and > efficient) code. > > For instance, I have the following function (very rudimentary) which > takes a (very specific) data frame as input and for certain subsets > calculates the rank correlation between two corresponding columns. > The aim is to add all the rank correlations. > > <code> > add.fun <- function(perf.data) { > ss <- 0 > for (i in 0:29) { > ss <- ss + cor(subset(perf.data, dataset == i)[3], > subset(perf.data, dataset == i)[7], method = "kendall") > } > ss > } > </code> > > As one can see this function uses a for-loop. Now chapter 9 of 'An > introduction to R' tells us that we should avoid for-loops as much as > possible. > > Is there an obvious way to avoid this for-loop is this case ? >Using the lapply function in the e-mail of James, I came up with the following. <code> sum (as.numeric( lapply( split(perf.data, perf.data$dataset), function(x) cor(x[3],x[7],method="kendall") ) )) </code> So, first I split the dataframe into a list of dataframes using split, and using lapply I get a list of correlations, which I convert to numeric and finally sum up. I definitely avoided the for-loop in this way, although I am not sure whether this is more efficient or not. Cheers, Stijn.> I would like to see something in the lines of > > (maple style) > > <code> > add( seq(FUN(i), i = 0..29) ) > </code> > > Greetings > > Stijn. > >-- =========================================================================Dept. of Applied Mathematics and Computer Science, University of Ghent Krijgslaan 281 - S9, B - 9000 Ghent, Belgium Phone: +32-9-264.48.91, Fax: +32-9-264.49.95 E-mail: Stijn.Lievens at ugent.be, URL: http://allserv.ugent.be/~slievens/
On Thu, 18 Nov 2004, Stijn Lievens wrote:> > <code> > add.fun <- function(perf.data) { > ss <- 0 > for (i in 0:29) { > ss <- ss + cor(subset(perf.data, dataset == i)[3], subset(perf.data, > dataset == i)[7], method = "kendall") > } > ss } > </code> > > As one can see this function uses a for-loop. Now chapter 9 of 'An > introduction to R' tells us that we should avoid for-loops as much as > possible.You don't say whether `dataset' is the name of a column in `perf.data'. Assuming it is, and assuming that 0:29 are all the values of `dataset' sum(by(perf.data, list(perf.data$dataset), function(d) cor(d[,3],d[,7], method="kendall"))) would work. If this is faster it will be because you don't call subset() twice per iteration, rather than because you are avoiding a loop. However it has other benefits: it doesn't have the variable `i', it doesn't have to change the value of `ss', and it doesn't have the range of `dataset' hard-coded into it. These are all clarity optimisations. -thomas