I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this? Thanks for any pointers Lennart
`S Programming' (see the FAQ) has a whole chapter with case studies. Beware that what is efficient under one version of S is not necessarily so under another, and that applies to R today vs R in 1999 (when those examples were done). However, the general principles are good for all time. On Tue, 17 Feb 2004 Lennart.Borgman at astrazeneca.com wrote:> I have been lurking in this list a while and searching in the archives to > find out how one learns to write fast R code. One solution seems to be to > write part of the code not in R but in C. However after finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have been more > interested in making the R code itself more efficient. I would like to find > more info about this. I have tried to mail the contact person for the > benchmark, but I have so recieved no reply. > > I am not an R programmer (or statistican) so I do not know R well. I am > looking for some advice about writing fast R code. What about the different > data types for example? Is there some good place to start to look for more > info about this?-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Lennart - My two rules are: 1. Be straightforward. Don't try to be too fancy. Don't worry about execution time until you have the WHOLE thing programmed and DOING everything you want it to. Then profile it, if it's really going to be run more than 1000 times. Execution time is NOT the issue. Code maintainability IS. 2. Use vector operations wherever possible. Avoid explicit loops. However, the admonition to avoid loops is probably much less important now than it was with the Splus of 10 or 15 years ago. (Not that I succeed in obeying these rules myself, all the time.) Remember: execution time is not the issue. memory size may be. clear, maintainable code definitely is. In my opinion, the occasional questions you will see on this list about incorporating C code, or trying to specify one data type over another, come up only in very unusual, special cases. Almost everything can be done without loops in straight R, if you think about it first. - tom blackwell - u michigan medical school - ann arbor - On Tue, 17 Feb 2004 Lennart.Borgman at astrazeneca.com wrote:> I have been lurking in this list a while and searching in the archives to > find out how one learns to write fast R code. One solution seems to be to > write part of the code not in R but in C. However after finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have been more > interested in making the R code itself more efficient. I would like to find > more info about this. I have tried to mail the contact person for the > benchmark, but I have so recieved no reply. > > I am not an R programmer (or statistican) so I do not know R well. I am > looking for some advice about writing fast R code. What about the different > data types for example? Is there some good place to start to look for more > info about this? > > > Thanks for any pointers > Lennart > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
When it comes to code optimization, what I've learned from Profs. Lumley & Bates (and also V&R's S Programming) is: Measure it. Write the code in several ways, and test and see how long each one takes. Use Rprof() to see where the code is taking the most time and concentrate on those. This strategy works for time-efficiency, but not necessarily memory-efficiency. For that, I still do not know how to `measure', other than monitoring memory used by the R process via `top' on Linux/Unix or the task manager on Windoze. HTH, Andy> From: Lennart.Borgman at astrazeneca.com > > I have been lurking in this list a while and searching in the > archives to > find out how one learns to write fast R code. One solution > seems to be to > write part of the code not in R but in C. However after > finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have been more > interested in making the R code itself more efficient. I > would like to find > more info about this. I have tried to mail the contact person for the > benchmark, but I have so recieved no reply. > > I am not an R programmer (or statistican) so I do not know R > well. I am > looking for some advice about writing fast R code. What about > the different > data types for example? Is there some good place to start to > look for more > info about this? > > > Thanks for any pointers > Lennart > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Lennart.Borgman at astrazeneca.com > Sent: Wednesday, February 18, 2004 3:36 AM > To: r-help at stat.math.ethz.ch > Subject: [R] How to write efficient R code > > I have been lurking in this list a while and searching in the > archives to > find out how one learns to write fast R code. One solution > seems to be to > write part of the code not in R but in C. However after > finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have beenmore> interested in making the R code itself more efficient. I > would like to find > more info about this. I have tried to mail the contact person forthe> benchmark, but I have so recieved no reply.One way to make your codes more efficient is to use "vectorisation" -- vectorise your codes. I'm not sure where you can find more information about it, but an example would be to use the apply() function on a data frame instead using a loop. Avoid loops if you can. Kevin -------------------------------------------- Ko-Kang Kevin Wang, MSc(Hon) SLC Stats Workshops Co-ordinator The University of Auckland New Zealand
You may also be interested in reading the latest article on artima.com (http://www.artima.com/intv/abstreffi.html) where Bjarne Stroustrup (the creator of C++) discusses some of the benefits and costs of abstraction, as well as premature vs. prudent optimisation. It is important to remember that the key to improving execution speeds is profiling your running code - we're not good at anticipating what parts of a program will be slow. It's much better to run the program and see. Hadley Lennart.Borgman at astrazeneca.com wrote:> I have been lurking in this list a while and searching in the archives to > find out how one learns to write fast R code. One solution seems to be to > write part of the code not in R but in C. However after finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have been more > interested in making the R code itself more efficient. I would like to find > more info about this. I have tried to mail the contact person for the > benchmark, but I have so recieved no reply. > > I am not an R programmer (or statistican) so I do not know R well. I am > looking for some advice about writing fast R code. What about the different > data types for example? Is there some good place to start to look for more > info about this? > > > Thanks for any pointers > Lennart >
Sebastian - For successive differences within a single column 'x' differences <- c(NA, diff(x)), same as differences <- c(NA, x[-1] - x[-length(x)]). See help("diff"), help("Subscript"). The second version also works when x is a matrix or a data frame, except now the result is a matrix or data frame of the same size. x <- data.frame(matrix(rnorm(1e+5), 1e+4)) dim(x) # 10000 10 differences <- rbind(rep(NA, 10), x[-1, ] - x[-dim(x)[1], ]) dim(differences) # 10000 10 However, you write "I need to do this for all the subsets of data created by the numbers in one of the columns of the data frame ..." and I'm not sure I understand how an 'id' column would create many subsets of the data. So the simple examples above may not answer the question you are asking. - tom blackwell - u michigan medical school - ann arbor - On Tue, 17 Feb 2004, Sebastian Luque wrote:> Hi, > > In fact, I've been trying to get rid of loops in my code for more > than a week now, but nothing I try seems to work. It sounds as if > you have lots of experience with loops, so would appreciate any > pointers you may have on the following. > > I want to create a column showing the difference between the ith > row and i-1. Of course, the first row won't have any value in it, > because there is nothing above it to subtract to. This is fairly > easy to do with a simple loop, but I need to do this for all the > subsets of data created by the numbers in one of the columns of > the data frame (say, an id column). I would greatly appreciate > any idea you may have on this. > > Thanks in advance. > > Best regards, > Sebastian > -- > Sebastian Luque > > sluque at mun.ca > >
Lennart.Borgman at astrazeneca.com wrote:> I have been lurking in this list a while and searching in the archives to > find out how one learns to write fast R code. One solution seems to be to > write part of the code not in R but in C. However after finding a benchmark > article (http://www.sciviews.org/other/benchmark.htm) I have been more > interested in making the R code itself more efficient. I would like to find > more info about this. I have tried to mail the contact person for the > benchmark, but I have so recieved no reply. > > I am not an R programmer (or statistican) so I do not know R well. I am > looking for some advice about writing fast R code. What about the different > data types for example? Is there some good place to start to look for more > info about this? > > > Thanks for any pointers > Lennart > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >Lennart To learn about "data types" take a look at the early chapters of An Introduction To R available at http://cran.r-project.org/manuals.html Richard -- Richard E. Remington III Statistician KERN Statistical Services, Inc. PO Box 1046 Boise, ID 83701 Tel: 208.426.0113 KernStat.com
I'm guessing what Sebatian want is to do the differencing by a stratifying variable such as ID; e.g., the data may look like: df <- as.data.frame(cbind(ID=rep(1:5, each=3), x=matrix(rnorm(45), 15, 3)) So using Tom's solution, one would do something like: mdiff <- function(x) x[-1,] - x[nrow(x),] sapply(split(df[,-1], df[,1]), mdiff) There could well be more efficient ways! Andy> From: Tom Blackwell > > Sebastian - > > For successive differences within a single column 'x' > > differences <- c(NA, diff(x)), > > same as > > differences <- c(NA, x[-1] - x[-length(x)]). > > See help("diff"), help("Subscript"). The second version also > works when x is a matrix or a data frame, except now the result > is a matrix or data frame of the same size. > > x <- data.frame(matrix(rnorm(1e+5), 1e+4)) > dim(x) # 10000 10 > differences <- rbind(rep(NA, 10), x[-1, ] - x[-dim(x)[1], ]) > dim(differences) # 10000 10 > > However, you write "I need to do this for all the subsets of data > created by the numbers in one of the columns of the data frame ..." > and I'm not sure I understand how an 'id' column would create many > subsets of the data. So the simple examples above may not answer > the question you are asking. > > - tom blackwell - u michigan medical school - ann arbor - > > On Tue, 17 Feb 2004, Sebastian Luque wrote: > > > Hi, > > > > In fact, I've been trying to get rid of loops in my code for more > > than a week now, but nothing I try seems to work. It sounds as if > > you have lots of experience with loops, so would appreciate any > > pointers you may have on the following. > > > > I want to create a column showing the difference between the ith > > row and i-1. Of course, the first row won't have any value in it, > > because there is nothing above it to subtract to. This is fairly > > easy to do with a simple loop, but I need to do this for all the > > subsets of data created by the numbers in one of the columns of > > the data frame (say, an id column). I would greatly appreciate > > any idea you may have on this. > > > > Thanks in advance. > > > > Best regards, > > Sebastian > > -- > > Sebastian Luque > > > > sluque at mun.ca > > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
Sorry about the typo. There should be a "-" in front of nrow(x); i.e., mdiff <- function(x) x[-1,] - x[-nrow(x),] ... and sapply() won't work, but lapply() will. So the whole thing looks like:> do.call("rbind",lapply(split(df[,-1], df$ID), function(x) x[-1,] -x[-nrow(x),])) V2 V3 V4 1.2 -0.1250197 0.6446575 -1.0504143 1.3 -0.4104924 0.5638618 2.4117082 2.5 -3.1917997 -1.8687987 -0.9026947 2.6 2.2405199 3.5321711 1.0417581 3.8 1.7029947 0.3666408 0.8117269 3.9 -1.6701011 -0.8246094 -0.9099002 4.11 0.5183960 1.1066630 1.0484818 4.12 0.3563826 -1.9202869 -3.5635572 5.14 2.2746317 2.9820733 -2.4086057 5.15 -2.5767889 -2.5492538 -0.3083154 However, looking at this, I can't imagine this being the most efficient way to go about it. If the IDs are contiguous (i.e., data for the same ID are in consecutive rows), then you can operate on the entire data and then throw out the unwanted row of each ID:> df.diff <- df[-1, -1] - df[-nrow(df), -1] > del <- which(diff(as.numeric(df$ID)) != 0) > del[1] 3 6 9 12> df.diff[-del,]V2 V3 V4 2 -0.1250197 0.6446575 -1.0504143 3 -0.4104924 0.5638618 2.4117082 5 -3.1917997 -1.8687987 -0.9026947 6 2.2405199 3.5321711 1.0417581 8 1.7029947 0.3666408 0.8117269 9 -1.6701011 -0.8246094 -0.9099002 11 0.5183960 1.1066630 1.0484818 12 0.3563826 -1.9202869 -3.5635572 14 2.2746317 2.9820733 -2.4086057 15 -2.5767889 -2.5492538 -0.3083154 HTH, Andy> From: Sebastian Luque [mailto:sluque at mun.ca] > > Hi, > > This is exactly what I meant Andy, the stratifying variable can be > thought of as a factor. However, I tried your code and I get > the error: > "Error in Ops.data.frame......- only defined for equally-sized data > frames". What may be happening? > The result of 'apply' functions, or 'split' or 'by' and the like give > lists as results, with a names attribute that, in my case, would have > the levels of the factor. How can one get the results back to a > data.frame object, with the newly calculated variables? The > indexing for > lists is not as straight forward as for data frames. > > Thanks to both of you for showing me the power of indexing in > R functions! > > Sebastian > > > Liaw, Andy wrote: > > >I'm guessing what Sebatian want is to do the differencing by > a stratifying > >variable such as ID; e.g., the data may look like: > > > >df <- as.data.frame(cbind(ID=rep(1:5, each=3), > x=matrix(rnorm(45), 15, 3)) > > > >So using Tom's solution, one would do something like: > > > >mdiff <- function(x) x[-1,] - x[nrow(x),] > >sapply(split(df[,-1], df[,1]), mdiff) > > > >There could well be more efficient ways! > > > >Andy > > > > > > > >>From: Tom Blackwell > >> > >>Sebastian - > >> > >>For successive differences within a single column 'x' > >> > >>differences <- c(NA, diff(x)), > >> > >>same as > >> > >>differences <- c(NA, x[-1] - x[-length(x)]). > >> > >>See help("diff"), help("Subscript"). The second version also > >>works when x is a matrix or a data frame, except now the result > >>is a matrix or data frame of the same size. > >> > >>x <- data.frame(matrix(rnorm(1e+5), 1e+4)) > >>dim(x) # 10000 10 > >>differences <- rbind(rep(NA, 10), x[-1, ] - x[-dim(x)[1], ]) > >>dim(differences) # 10000 10 > >> > >>However, you write "I need to do this for all the subsets of data > >>created by the numbers in one of the columns of the data frame ..." > >>and I'm not sure I understand how an 'id' column would create many > >>subsets of the data. So the simple examples above may not answer > >>the question you are asking. > >> > >>- tom blackwell - u michigan medical school - ann arbor - > >> > >>On Tue, 17 Feb 2004, Sebastian Luque wrote: > >> > >> > >> > >>>Hi, > >>> > >>>In fact, I've been trying to get rid of loops in my code for more > >>>than a week now, but nothing I try seems to work. It sounds as if > >>>you have lots of experience with loops, so would appreciate any > >>>pointers you may have on the following. > >>> > >>>I want to create a column showing the difference between the ith > >>>row and i-1. Of course, the first row won't have any value in it, > >>>because there is nothing above it to subtract to. This is fairly > >>>easy to do with a simple loop, but I need to do this for all the > >>>subsets of data created by the numbers in one of the columns of > >>>the data frame (say, an id column). I would greatly appreciate > >>>any idea you may have on this. > >>> > >>>Thanks in advance. > >>> > >>>Best regards, > >>>Sebastian > >>>-- > >>> Sebastian Luque > >>> > >>>sluque at mun.ca > >>> > >>> > >>> > >>> > >>______________________________________________ > >>R-help at stat.math.ethz.ch mailing list > >>https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >>PLEASE do read the posting guide! > >>http://www.R-project.org/posting-guide.html > >> > >> > >> > >> > > > > > >------------------------------------------------------------- > ----------------- > >Notice: This e-mail message, together with any attachments, contains > >information of Merck & Co., Inc. (One Merck Drive, > Whitehouse Station, New > >Jersey, USA 08889), and/or its affiliates (which may be > known outside the > >United States as Merck Frosst, Merck Sharp & Dohme or MSD > and in Japan, as > >Banyu) that may be confidential, proprietary copyrighted > and/or legally > >privileged. It is intended solely for the use of the > individual or entity > >named on this message. If you are not the intended > recipient, and have > >received this message in error, please notify us immediately > by reply e-mail > >and then delete it from your system. > >------------------------------------------------------------- > ----------------- > > > > > > > > -- > Sebastian Luque > > sluque at mun.ca > Tel.: +1 (204) 586-8170 > > > > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
Tom Blackwell
2004-Feb-18 13:40 UTC
[R] perhaps 'aggregate()' (was: How to write efficient R code)
Sebastian and Andy - Yes, Andy has read the question correctly. A similar task that I do quite often is to subtract the mean of a class from all of the members of the class, and do this within every column of a (numeric) data frame. Kurt Hornik's function aggregate() is the one to use. Here's an example using the same data set that he uses in the example on the help page. (Only the commands are shown here. You'll have to try them to see the output, because I cannot cut and paste easily into my email.) data(state) ls() # This data set puts individual columns into your workspace, # rather than making a data frame of them. example <- data.frame(state.abb, state.name, state.region, state.x77) str(example) means <- aggregate(example[ ,3+seq(8)], list(example[ ,3]), mean) str(means) residuals <- example[ ,3+seq(8)] - means[as.numeric(example[ ,3]), -1] result <- cbind(example[ ,seq(3)], residuals) str(result) -- Ah, I think this example would be easier to read if I had used the columns from the workspace directly, rather than packaging them into a data frame 'example' first, the using numeric subscripts on the data frame. But, at least this illustrates some common ways of subscripting subsets of columns from a data frame. Again, see help("aggregate"), help("Subscript") to see what I am doing here. - best - tom blackwell - u michigan medical school - ann arbor - (Ah, I see that Andy has just replied this morning as well. I'll see what his reply was as soon as I send off this one.) On Tue, 17 Feb 2004, Sebastian Luque wrote:> Hi, > > This is exactly what I meant Andy, the stratifying variable can be > thought of as a factor. However, I tried your code and I get the error: > "Error in Ops.data.frame......- only defined for equally-sized data > frames". What may be happening? > The result of 'apply' functions, or 'split' or 'by' and the like give > lists as results, with a names attribute that, in my case, would have > the levels of the factor. How can one get the results back to a > data.frame object, with the newly calculated variables? The indexing for > lists is not as straight forward as for data frames. > > Thanks to both of you for showing me the power of indexing in R functions! > > Sebastian > > > Liaw, Andy wrote: > > >I'm guessing what Sebatian want is to do the differencing by a stratifying > >variable such as ID; e.g., the data may look like: > > > >df <- as.data.frame(cbind(ID=rep(1:5, each=3), x=matrix(rnorm(45), 15, 3)) > > > >So using Tom's solution, one would do something like: > > > >mdiff <- function(x) x[-1,] - x[nrow(x),] > >sapply(split(df[,-1], df[,1]), mdiff) > > > >There could well be more efficient ways! > > > >Andy > > > > > > > >>From: Tom Blackwell > >> > >>Sebastian - > >> > >>For successive differences within a single column 'x' > >> > >>differences <- c(NA, diff(x)), > >> > >>same as > >> > >>differences <- c(NA, x[-1] - x[-length(x)]). > >> > >>See help("diff"), help("Subscript"). The second version also > >>works when x is a matrix or a data frame, except now the result > >>is a matrix or data frame of the same size. > >> > >>x <- data.frame(matrix(rnorm(1e+5), 1e+4)) > >>dim(x) # 10000 10 > >>differences <- rbind(rep(NA, 10), x[-1, ] - x[-dim(x)[1], ]) > >>dim(differences) # 10000 10 > >> > >>However, you write "I need to do this for all the subsets of data > >>created by the numbers in one of the columns of the data frame ..." > >>and I'm not sure I understand how an 'id' column would create many > >>subsets of the data. So the simple examples above may not answer > >>the question you are asking. > >> > >>- tom blackwell - u michigan medical school - ann arbor - > >> > >>On Tue, 17 Feb 2004, Sebastian Luque wrote: > >> > >> > >> > >>>Hi, > >>> > >>>In fact, I've been trying to get rid of loops in my code for more > >>>than a week now, but nothing I try seems to work. It sounds as if > >>>you have lots of experience with loops, so would appreciate any > >>>pointers you may have on the following. > >>> > >>>I want to create a column showing the difference between the ith > >>>row and i-1. Of course, the first row won't have any value in it, > >>>because there is nothing above it to subtract to. This is fairly > >>>easy to do with a simple loop, but I need to do this for all the > >>>subsets of data created by the numbers in one of the columns of > >>>the data frame (say, an id column). I would greatly appreciate > >>>any idea you may have on this. > >>> > >>>Thanks in advance. > >>> > >>>Best regards, > >>>Sebastian > >>>-- > >>> Sebastian Luque > >>> > >>>sluque at mun.ca > >>> > >>> > >>> > >>> > >>______________________________________________ > >>R-help at stat.math.ethz.ch mailing list > >>https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >>PLEASE do read the posting guide! > >>http://www.R-project.org/posting-guide.html > >> > >> > >> > >> > > > > > >------------------------------------------------------------------------------ > >Notice: This e-mail message, together with any attachments, contains > >information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New > >Jersey, USA 08889), and/or its affiliates (which may be known outside the > >United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > >Banyu) that may be confidential, proprietary copyrighted and/or legally > >privileged. It is intended solely for the use of the individual or entity > >named on this message. If you are not the intended recipient, and have > >received this message in error, please notify us immediately by reply e-mail > >and then delete it from your system. > >------------------------------------------------------------------------------ > > > > > > > > -- > Sebastian Luque > > sluque at mun.ca > Tel.: +1 (204) 586-8170 > > > > >