Dennis Fisher
2010-Jan-09 15:59 UTC
[R] Reducing the size of a large script top speed onset of execution
Colleagues, (R 2.10 on all platforms) I have a lengthy script (18000 lines) that runs within a graphical interface. The script consists of 100's of function followed by a single command that calls these functions (execution depends on a number of environment variables passed to the script). As a result, nothing is executed until the final line of code is read. It takes 15-20 seconds to load the code - I would like to speed that process. Two questions: 1. The code contains numerous large blocks that are executed under only one set of conditions (which are known when the code is called). For example, there might be code such as: if (CONDITION) { ... (hundreds of lines of code, including embedded curly brackets) } else invisible() if (!CONDITION) { ... (hundreds of lines of code, including embedded curly brackets) } I assume that I could speed loading appreciably if I set up two scripts, each of which excluded "irrelevant" code depending on the CONDITION. For example, if I knew that CONDITION was false, I would exclude the first block of code above; conversely, if I know that CONDITION was true, I would exclude the second block. I would like to write code in R (or in sed [UNIX stream editor]) to create these two new scripts. However, the regular expressions that would be needed are beyond me and I would appreciate help from this forum. Specifically, I would like to search for: if (CONDITION or if (!CONDITION as the start of the block and } - the matching curly bracket at the end of the block, then remove those lines from the code. These text entries are always on a line by themselves. Finding the "if (CONDITION" line should be relatively easy. The difficulty for me is identifying the matching curly bracket - there are often paired brackets within the block of code: if (CONDITION) { ... if (SOMETHINGELSE) { } if (YETANOTHER) { } } <- this is the bracket that I need to match There are also instances in which the entire block occurs on one line: if (CONDITION) { ...} else invisible() or if (CONDITION ... else invisible() Of note, I can remove the "else invisible() statements if they are problematic to a solution. 2. A related issue regards loading in the graphical interface vs. loading at the command line (OS X). The graphical interface loads in 15-20 seconds - the graphical interface is sending code as rapidly as it can. In contrast, at the command line, the course is source()'d and it takes 30-40 seconds. I would have expected the latter approach to be as fast or faster because R would accept code as fast as it could. Does anyone have an explanation for this behavior; also, any ideas as to how to speed the process at the command line would be appreciated. Thanks for any suggestions. Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
Prof Brian Ripley
2010-Jan-09 16:53 UTC
[R] Reducing the size of a large script top speed onset of execution
Please just use make a package; then all the effort of parsing the code is done at install time, you can use lazy-loading .... Or if you are for some reason averse to that, source the code into an environment, save that and simply attach() its save file next time. Packages of that size load in a few milliseconds (as you see each time you start R: stats is 27000 lines). source() is doing more work to allow it to guess encodings, keeping references to the original sources, back out code if the whole script does not parse .... On Sat, 9 Jan 2010, Dennis Fisher wrote:> Colleagues, > > (R 2.10 on all platforms) > > I have a lengthy script (18000 lines) that runs within a graphical interface. > The script consists of 100's of function followed by a single command that > calls these functions (execution depends on a number of environment variables > passed to the script). As a result, nothing is executed until the final line > of code is read. It takes 15-20 seconds to load the code - I would like to > speed that process. Two questions: > > 1. The code contains numerous large blocks that are executed under only one > set of conditions (which are known when the code is called). For example, > there might be code such as: > if (CONDITION) > { > ... (hundreds of lines of code, including embedded curly > brackets) > } else invisible() > if (!CONDITION) > { > ... (hundreds of lines of code, including embedded curly > brackets) > } > I assume that I could speed loading appreciably if I set up two scripts, each > of which excluded "irrelevant" code depending on the CONDITION. For example, > if I knew that CONDITION was false, I would exclude the first block of code > above; conversely, if I know that CONDITION was true, I would exclude the > second block. > > I would like to write code in R (or in sed [UNIX stream editor]) to create > these two new scripts. However, the regular expressions that would be needed > are beyond me and I would appreciate help from this forum. Specifically, I > would like to search for: > if (CONDITION > or > if (!CONDITION > as the start of the block and > } - the matching curly bracket > at the end of the block, then remove those lines from the code. These text > entries are always on a line by themselves. Finding the "if (CONDITION" line > should be relatively easy. The difficulty for me is identifying the matching > curly bracket - there are often paired brackets within the block of code: > > if (CONDITION) > { > ... > if (SOMETHINGELSE) { } > if (YETANOTHER) > { > } > } <- this is the bracket that > I need to match > > There are also instances in which the entire block occurs on one line: > if (CONDITION) { ...} else invisible() > or > if (CONDITION ... else invisible() > > Of note, I can remove the "else invisible() statements if they are > problematic to a solution. > > 2. A related issue regards loading in the graphical interface vs. loading at > the command line (OS X). The graphical interface loads in 15-20 seconds - > the graphical interface is sending code as rapidly as it can. In contrast, > at the command line, the course is source()'d and it takes 30-40 seconds. I > would have expected the latter approach to be as fast or faster because R > would accept code as fast as it could. > > Does anyone have an explanation for this behavior; also, any ideas as to how > to speed the process at the command line would be appreciated. Thanks for > any suggestions. > > Dennis > > > > > Dennis Fisher MD > P < (The "P Less Than" Company) > Phone: 1-866-PLessThan (1-866-753-7784) > Fax: 1-866-PLessThan (1-866-753-7784) > www.PLessThan.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595