Cyclic Group Z_1
2019-Aug-25 04:08 UTC
[Rd] Conventions: Use of globals and main functions
In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages. On the other hand, in Python, it is common to use a main function (through the `def main():` and? `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts. Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level. Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation? This question was motivated largely by this discussion on Reddit:?https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/?. Apologies beforehand if any of these (partially subjective) assessments are in error. Best, CG
This is what I usually put in scripts: if (is.null(sys.calls())) { main() } This is mostly equivalent to the Python idiom. It the script runs from Rscript, then it will run main(). It also lets you source() the script, and debug its functions, test them, etc. It works best if all the code in the script is organized into functions. Gabor On Sun, Aug 25, 2019 at 6:11 AM Cyclic Group Z_1 via R-devel <r-devel at r-project.org> wrote:> > In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages. > > On the other hand, in Python, it is common to use a main function (through the `def main():` and `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts. > > Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level. > > Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation? > > This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error. > > Best, > CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
On 25/08/2019 12:08 a.m., Cyclic Group Z_1 via R-devel wrote:> In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages. > > On the other hand, in Python, it is common to use a main function (through the `def main():` and? `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts. > > Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level. > > Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation?Lexical scoping means that all of the problems of global variables are available to writers who use main(). You could treat the evaluation frame of your main function exactly like the global workspace: define functions within it, read and modify local variables from those functions, etc. The benefit of using main() if you avoid defining all the other functions within it is that other functions normally operate on their arguments with few side effects. You achieve this in R by putting those other functions in packages, and running those functions in short scripts. That's how I've always recommended large projects be organized. You don't want a long script for anything, and you don't want multiple source files unless they're in a package. Duncan Murdoch> > This question was motivated largely by this discussion on Reddit:?https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/?. Apologies beforehand if any of these (partially subjective) assessments are in error. > > Best, > CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Cyclic Group Z_1
2019-Aug-25 21:46 UTC
[Rd] Conventions: Use of globals and main functions
This seems like a nice idiom; I've seen others use? ? if(!interactive()){? ? ? ? main()? ? }to a similar effect. Best,CG On Sunday, August 25, 2019, 01:16:06 AM CDT, G?bor Cs?rdi <csardi.gabor at gmail.com> wrote: This is what I usually put in scripts: if (is.null(sys.calls())) { ? main() } This is mostly equivalent to the Python idiom. It the script runs from Rscript, then it will run main(). It also lets you source() the script, and debug its functions, test them, etc. It works best if all the code in the script is organized into functions. Gabor On Sun, Aug 25, 2019 at 6:11 AM Cyclic Group Z_1 via R-devel <r-devel at r-project.org> wrote:> > In R scripts (as opposed to packages), even in reproducible scripts, it seems fairly conventional to use the global workspace as a sort of main function, and thus R scripts often populate the global environment with many variables, which may be mutated. Although this makes sense given R has historically been used interactively and this practice is common for scripting languages, this appears to disagree with the software-engineering principle of avoiding a mutating global state. Although this is just a rule of thumb, in R scripts, the frequent use of global variables is much more pronounced than in other languages. > > On the other hand, in Python, it is common to use a main function (through the `def main():` and? `if __name__ == "__main__":` idioms). This is mentioned both in the documentation as well as in the writing of Python's main creator. Although this is more beneficial in Python than in R because Python code is structured into modules, which serve as both scripts and packages, whereas R separates these conceptually, a similar practice of creating a main function would help avoid the issues from mutating global state common to other languages and facilitate maintainability, especially for longer scripts. > > Although many great R texts (Advanced R, Art of R Programming, etc.) caution against assignment in a parent enclosure (e.g., using `<<-`, or `assign`), I have not seen many promote the use of a main function and avoiding mutating global variables from top level. > > Would it be a good idea to promote use of main functions and limiting global-state mutation for longer scripts and dedicated applications (not one-off scripts)? Should these practices be mentioned in the standard documentation? > > This question was motivated largely by this discussion on Reddit: https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . Apologies beforehand if any of these (partially subjective) assessments are in error. > > Best, > CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]
Cyclic Group Z_1
2019-Aug-25 23:09 UTC
[Rd] Conventions: Use of globals and main functions
This is a fair point; structuring functions into packages is probably ultimately the gold standard for code organization in R. However, lexical scoping in R is really not much different than in other languages, such as Python, in which use of main functions and defining other named functions outside of main are encouraged. For example, in Scheme, from which R derives its scoping rules, the community generally organizes code with almost exclusively functions and few non-function global variables at top level. The common use of globals in R seems to be mostly a consequence of historical interactive use and, relatedly, an inherited practice from S. It is true, though, that since anonymous functions (such as in lapply) play a large part in idiomatic R code, as you put it, "[l]exical scoping means that all of the problems of global variables are available to writers who use main()." Nevertheless, using a main function with other functions defined outside it seems like a good quick alternative that offers similar advantages to making a package when functions are tightly coupled to the script and the project may not be large or generalizable enough to warrant making a package. Best, CG
Hey, I always found it a strength of R compared to many other langaugas that simple things (running a script, doing something interactive, writing a function, using lambdas, installing packages, getting help, ...) are very very simple. R is a commandline statistics program that happens to be a very elegant, simple and consistent programming language too. That beeing said I think the main task of scripts is to get things done via running them end to end in a fresh session. Now, it very well may happen that a lot of stuff has to be done. Than splitting up scripts into subscripts and sourcing them from a meta script is a straightforward solution. It might also be that some functionality is put into functions to be reused in other places. This can be done by putting those function definitions into separate files. Than one cane use source wherever those functions are needed. Now, putting stuff that runs code and scripts that define/provovide functions into the same script is a bad idea. Using the main()-idioms described might prevent this the problems stemming from mixing functions and function execution. But it would also encourage this mixing which is - I think, a bad idea anyways. Therefore, I am against fostering a main()-idiom - it adds complexity and encourages bad code structuring (putting application code and function definition code into one file). If one needs code to behave differenlty in interactive sessions than in non-interactive sessions - if( interactive() ){ } is one way to solve this. If more solid software developement is needed packages are the way to go. Best, Peter Am So., 25. Aug. 2019 um 06:11 Uhr schrieb Cyclic Group Z_1 via R-devel < r-devel at r-project.org>:> In R scripts (as opposed to packages), even in reproducible scripts, it > seems fairly conventional to use the global workspace as a sort of main > function, and thus R scripts often populate the global environment with > many variables, which may be mutated. Although this makes sense given R has > historically been used interactively and this practice is common for > scripting languages, this appears to disagree with the software-engineering > principle of avoiding a mutating global state. Although this is just a rule > of thumb, in R scripts, the frequent use of global variables is much more > pronounced than in other languages. > > On the other hand, in Python, it is common to use a main function (through > the `def main():` and `if __name__ == "__main__":` idioms). This is > mentioned both in the documentation as well as in the writing of Python's > main creator. Although this is more beneficial in Python than in R because > Python code is structured into modules, which serve as both scripts and > packages, whereas R separates these conceptually, a similar practice of > creating a main function would help avoid the issues from mutating global > state common to other languages and facilitate maintainability, especially > for longer scripts. > > Although many great R texts (Advanced R, Art of R Programming, etc.) > caution against assignment in a parent enclosure (e.g., using `<<-`, or > `assign`), I have not seen many promote the use of a main function and > avoiding mutating global variables from top level. > > Would it be a good idea to promote use of main functions and limiting > global-state mutation for longer scripts and dedicated applications (not > one-off scripts)? Should these practices be mentioned in the standard > documentation? > > This question was motivated largely by this discussion on Reddit: > https://www.reddit.com/r/rstats/comments/cp3kva/is_mutating_global_state_acceptable_in_r/ . > Apologies beforehand if any of these (partially subjective) assessments are > in error. > > Best, > CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
> this appears to disagree with the software-engineering principle of avoiding a mutating global stateI disagree. In embedded systems engineering, for example, it's customary to use global variables to represent ports. Also, I note that the use of global variables, is similar to using pen and paper, to do mathematics and statistics. (Which is good). Whether that's consistent with software engineering principles or not, I don't know. However, I partly agree with you. Given that there's interest from various parties in running R in various ways, it may be good to document some of the options available. "Running R" (in "R Installation and Administration") links to "Appendix B Invoking R" (in "An Introduction to R"). However, these sections do not cover the topics in this thread.
> "Running R" (in "R Installation and Administration") links to > "Appendix B Invoking R" (in "An Introduction to R"). > However, these sections do not cover the topics in this thread.Sorry, I made a mistake. It is in the documentation (B.4 Scripting with R) e.g. (excerpts only) R CMD BATCH "--args arg1 arg2" foo.R & args <- commandArgs(TRUE) Rscript foo.R arg1 arg2
Henrik Bengtsson
2019-Aug-27 22:39 UTC
[Rd] Conventions: Use of globals and main functions
FWIW, one could imagine introducing a helper function global(); global <- function(expr) { eval(substitute(expr), envir = globalenv(), enclos = baseenv()) } to make it explicit that any assignments (and evaluation in general) take place in the global environment, e.g.> local({ global(a <- 2) }) > a[1] 2 That "looks" nicer than assign("a", 2, envir = globalenv()) and it's safer than assuming a <<- 2 will "reach" the global environment. /Henrik On Tue, Aug 27, 2019 at 3:19 PM Abby Spurdle <spurdle.a at gmail.com> wrote:> > > this appears to disagree with the software-engineering principle of avoiding a mutating global state > > I disagree. > In embedded systems engineering, for example, it's customary to use > global variables to represent ports. > > Also, I note that the use of global variables, is similar to using pen > and paper, to do mathematics and statistics. > (Which is good). > Whether that's consistent with software engineering principles or not, > I don't know. > > However, I partly agree with you. > Given that there's interest from various parties in running R in > various ways, it may be good to document some of the options > available. > > "Running R" (in "R Installation and Administration") links to > "Appendix B Invoking R" (in "An Introduction to R"). > However, these sections do not cover the topics in this thread. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Cyclic Group Z_1
2019-Aug-28 03:16 UTC
[Rd] Conventions: Use of globals and main functions
Definitely, I agree that global variables have a place in programming. They play an especially important role in low-level software, such as embedded programming, as you mentioned, and?systems programming. I generally would disagree with anyone that says global variables should never be used, and they may be the best implementation option when something is "truly global." However, in R scripting conventions, they are the default. I don't think it is controversial to say that in software engineering culture, there is a generally held principle that global variables should be minimized because they can be dangerous (granted, the original "Globals considered harmful" article is quite old, and many of the criticisms not applicable to modern languages). I do think it is equally important, though, to understand when to break this rule. I like your suggestion of documenting this as an alternative option, though it seems the general sentiment is against this, which I respect. Best, CG
Cyclic Group Z_1
2019-Aug-28 03:56 UTC
[Rd] Conventions: Use of globals and main functions
>?That beeing said I think the main task of scripts is to get things done via running them end to end in a fresh session. Now, it very well may happen that a lot of stuff has to be done. Than splitting up scripts into subscripts and sourcing them from a meta script is a straightforward solution. It might also be that some functionality is put into functions to be reused in other places. This can be done by putting those function definitions into separate files. Than one cane use source wherever those functions?are needed. Now, putting stuff that runs code and scripts that define/provovide functions into the same script is a bad idea. Using the main()-idioms described might prevent this the problems stemming from mixing functions and function execution. But it would also encourage this mixing which is - I think, a bad idea anyways.I actually would agree entirely that files should not serve as both source files for re-used functions as well as application code. The suggestion for a main() idiom is merely to reduce variable scope and bring R practices more in line with generally recommended programming practices, not so that they can act as packages/modules/libraries. When I compared R scripts containing main functions to packages, I only mean in the sense that they help manage scope (the latter through package namespaces). Any other named functions besides main would be functions specifically tied to the script.? I do see your point, though, that this could result in bad practice, namely the usage mixing you described.? Best, CG