Dear Rexperts, First of all let me say that R is a wonderful and useful piece of software. The only thing is that sometimes it takes me a long time to find out how something can be done, especially when aiming to write compact (and efficient) code. For instance, I have the following function (very rudimentary) which takes a (very specific) data frame as input and for certain subsets calculates the rank correlation between two corresponding columns. The aim is to add all the rank correlations. <code> add.fun <- function(perf.data) { ss <- 0 for (i in 0:29) { ss <- ss + cor(subset(perf.data, dataset == i)[3], subset(perf.data, dataset == i)[7], method = "kendall") } ss } </code> As one can see this function uses a for-loop. Now chapter 9 of 'An introduction to R' tells us that we should avoid for-loops as much as possible. Is there an obvious way to avoid this for-loop is this case ? I would like to see something in the lines of (maple style) <code> add( seq(FUN(i), i = 0..29) ) </code> Greetings Stijn. -- =========================================================================Dept. of Applied Mathematics and Computer Science, University of Ghent Krijgslaan 281 - S9, B - 9000 Ghent, Belgium Phone: +32-9-264.48.91, Fax: +32-9-264.49.95 E-mail: Stijn.Lievens at ugent.be, URL: http://allserv.ugent.be/~slievens/
Stijn Lievens wrote:> Dear Rexperts, > > First of all let me say that R is a wonderful and useful piece of software. > > The only thing is that sometimes it takes me a long time to find out how > something can be done, especially when aiming to write compact (and > efficient) code. > > For instance, I have the following function (very rudimentary) which > takes a (very specific) data frame as input and for certain subsets > calculates the rank correlation between two corresponding columns. > The aim is to add all the rank correlations. > > <code> > add.fun <- function(perf.data) { > ss <- 0 > for (i in 0:29) { > ss <- ss + cor(subset(perf.data, dataset == i)[3], > subset(perf.data, dataset == i)[7], method = "kendall") > } > ss > } > </code> > > As one can see this function uses a for-loop. Now chapter 9 of 'An > introduction to R' tells us that we should avoid for-loops as much as > possible. > > Is there an obvious way to avoid this for-loop is this case ? >Using the lapply function in the e-mail of James, I came up with the following. <code> sum (as.numeric( lapply( split(perf.data, perf.data$dataset), function(x) cor(x[3],x[7],method="kendall") ) )) </code> So, first I split the dataframe into a list of dataframes using split, and using lapply I get a list of correlations, which I convert to numeric and finally sum up. I definitely avoided the for-loop in this way, although I am not sure whether this is more efficient or not. Cheers, Stijn.> I would like to see something in the lines of > > (maple style) > > <code> > add( seq(FUN(i), i = 0..29) ) > </code> > > Greetings > > Stijn. > >-- =========================================================================Dept. of Applied Mathematics and Computer Science, University of Ghent Krijgslaan 281 - S9, B - 9000 Ghent, Belgium Phone: +32-9-264.48.91, Fax: +32-9-264.49.95 E-mail: Stijn.Lievens at ugent.be, URL: http://allserv.ugent.be/~slievens/
On Thu, 18 Nov 2004, Stijn Lievens wrote:> > <code> > add.fun <- function(perf.data) { > ss <- 0 > for (i in 0:29) { > ss <- ss + cor(subset(perf.data, dataset == i)[3], subset(perf.data, > dataset == i)[7], method = "kendall") > } > ss } > </code> > > As one can see this function uses a for-loop. Now chapter 9 of 'An > introduction to R' tells us that we should avoid for-loops as much as > possible.You don't say whether `dataset' is the name of a column in `perf.data'. Assuming it is, and assuming that 0:29 are all the values of `dataset' sum(by(perf.data, list(perf.data$dataset), function(d) cor(d[,3],d[,7], method="kendall"))) would work. If this is faster it will be because you don't call subset() twice per iteration, rather than because you are avoiding a loop. However it has other benefits: it doesn't have the variable `i', it doesn't have to change the value of `ss', and it doesn't have the range of `dataset' hard-coded into it. These are all clarity optimisations. -thomas