Antonin Klima
2017-May-08 12:08 UTC
[Rd] A few suggestions and perspectives from a PhD student
Thanks for the answers, I?m aware of the ?.? option, just wanted to give a very simple example. But the lapply ??' parameter use has eluded me and thanks for enlightening me. What do you mean by messing up the call stack. As far as I understand it, piping should translate into same code as deep nesting. So then I only see a tiny downside for debugging here. No loss of time/space efficiency or anything. With a change of inadvertent error in your example, coming from the fact that a variable is being reused and noone now checks for me whether it is being passed between the lines. And with having to specify the variable every single time. For me, that solution is clearly inferior. Too bad you didn?t find my other comments interesting though.>Why do you think being implemented in a contributed package restricts >the usefulness of a feature?I guess it depends on your philosophy. It may not restrict it per say, although it would make a lot of sense to me reusing the bash-style ?|' and have a shorter, more readable version. One has extra dependence on a package for an item that fits the language so well that it should be its part. It is without doubt my most used operator at least. Going to some of my folders I found 101 uses in 750 lines, and 132 uses in 3303 lines. I would compare it to having a computer game being really good with a fan-created mod, but lacking otherwise. :) So to me, it makes sense that if there is no doubt that a feature improves the language, and especially if people extensively use it through a package already, it should be part of the ?standard?. Question is whether it is indeed very popular, and whether you share my view. But that?s now up to you, I just wanted to point it out I guess. Best Regards, Antonin> On 05 May 2017, at 22:33, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > Regarding the anonymous-function-in-a-pipeline point one can already > do this which does use brackets but even so it involves fewer > characters than the example shown. Here { . * 2 } is basically a > lambda whose argument is dot. Would this be sufficient? > > library(magrittr) > > 1.5 %>% { . * 2 } > ## [1] 3 > > Regarding currying note that with magrittr Ista's code could be written as: > > 1:5 %>% lapply(foo, y = 3) > > or at the expense of slightly more verbosity: > > 1:5 %>% Map(f = . %>% foo(y = 3)) > > > On Fri, May 5, 2017 at 1:00 PM, Antonin Klima <antonink at idi.ntnu.no> wrote: >> Dear Sir or Madam, >> >> I am in 2nd year of my PhD in bioinformatics, after taking my Master?s in computer science, and have been using R heavily during my PhD. As such, I have put together a list of certain features in R that, in my opinion, would be beneficial to add, or could be improved. The first two are already implemented in packages, but given that it is implemented as user-defined operators, it greatly restricts its usefulness. I hope you will find my suggestions interesting. If you find time, I will welcome any feedback as to whether you find the suggestions useful, or why you do not think they should be implemented. I will also welcome if you enlighten me with any features I might be unaware of, that might solve the issues I have pointed out below. >> >> 1) piping >> Currently available in package magrittr, piping makes the code better readable by having the line start at its natural starting point, and following with functions that are applied - in order. The readability of several nested calls with a number of parameters each is almost zero, it?s almost as if one would need to come up with the solution himself. Pipeline in comparison is very straightforward, especially together with the point (2). >> >> The package here works rather good nevertheless, the shortcomings of piping not being native are not quite as severe as in point (2). Nevertheless, an intuitive symbol such as | would be helpful, and it sometimes bothers me that I have to parenthesize anonymous function, which would probably not be required in a native pipe-operator, much like it is not required in f.ex. lapply. That is, >> 1:5 %>% function(x) x+2 >> should be totally fine >> >> 2) currying >> Currently available in package Curry. The idea is that, having a function such as foo = function(x, y) x+y, one would like to write for example lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not make a value result, but it can still give a function result - a function of y. This would be indeed most useful for various apply functions, rather than writing function(x) foo(3,x). >> >> I suggest that currying would make the code easier to write, and more readable, especially when using apply functions. One might imagine that there could be some confusion with such a feature, especially from people unfamiliar with functional programming, although R already does take function as first-order arguments, so it could be just fine. But one could address it with special syntax, such as $foo(3) [$foo(x=3)] for partial application. The current currying package has very limited usefulness, as, being limited by the user-defined operator framework, it only rarely can contribute to less code/more readability. Compare yourself: >> $foo(x=3) vs foo %<% 3 >> goo = function(a,b,c) >> $goo(b=3) vs goo %><% list(b=3) >> >> Moreover, one would often like currying to have highest priority. For example, when piping: >> data %>% foo %>% foo1 %<% 3 >> if one wants to do data %>% foo %>% $foo(x=3) >> >> 3) Code executable only when running the script itself >> Whereas the first two suggestions are somewhat stealing from Haskell and the like, this suggestion would be stealing from Python. I?m building quite a complicated pipeline, using S4 classes. After defining the class and its methods, I also define how to build the class to my likings, based on my input data, using various now-defined methods. So I end up having a list of command line arguments to process, and the way to create the class instance based on them. If I write it to the class file, however, I end up running the code when it is sourced from the next step in the pipeline, that needs the previous class definitions. >> >> A feature such as pythonic ?if __name__ == __main__? would thus be useful. As it is, I had to create run scripts as separate files. Which is actually not so terrible, given the class and its methods often span a few hundred lines, but still. >> >> 4) non-exported global variables >> I also find it lacking, that I seem to be unable to create constants that would not get passed to files that source the class definition. That is, if class1 features global constant CONSTANT=3, then if class2 sources class1, it will also include the constant. This 1) clutters the namespace when running the code interactively, 2) potentially overwrites the constants in case of nameclash. Some kind of export/nonexport variable syntax, or symbolic import, or namespace would be useful. I know if I converted it to a package I would get at least something like a namespace, but still. >> >> I understand that the variable cannot just not be imported, in general, as the functions will generally rely on it (otherwise it wouldn?t have to be there). But one could consider hiding it in an implicit namespace for the file, for example. >> >> 5) S4 methods with same name, for different classes >> Say I have an S4 class called datasetSingle, and another S4 class called datasetMulti, which gathers up a number of datasetSingle classes, and adds some extra functionality on top. The datasetSingle class may have a method replicates, that returns a named vector assigning replicate number to experiment names of the dataset. But I would also like to have a function with the same name for the datasetMulti class, that returns for data frame, or list, covering replicate numbers for all the datasets included. >> >> But then, I need to setGeneric for the method. But if I set generic before both implementations, I will reset the generic in the second call, losing the definition for ?replicates? for datasetSingle. Skipping this in the code for datasetMulti means that 1) I have to remember that I had the function defined for datasetSingle, 2) if I remove the function or change its name in datasetSingle, I now have to change the datasetMulti class file too. Moreover, if I would like to have a different generic for the datasetMulti version, I have to change it not in datasetMulti class file, but in the datasetSingle file, where it might not make much sense. In this case, I wanted to have another argument ?datasets?, which would return the replicates only for the datasets specified, rather than for all. >> >> I made a wrapper that could circumvent the first issue, but the second issue is not easy to circumvent. >> >> 6) Many parameters freeze S4 method calls >> If I specify ca over 6 parameters for an S4 method, I would often get a ?freeze? on the method call. The process would eat up a lot of memory before going into the call, upon which it would execute the call as normal (if it didn?t run out of memory or I didn?t run out of patience). Subsequent calls of the method would not include this overhead. The amount of memory this could take could be in gigabytes, and the time in minutes. I suspect this might be due to generating an entry in call table for each accepted signature. It can be circumvented, but sure isn?t a behaviour one would expect. >> >> 7) Default values for S4 methods >> It would seem that it is not possible to set up default parameters for an S4 method in a usual way of definiton = function (x, y=5). I resorted to making class unions with ?missing? for signatures on the call, with the call starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does not improve readability or ease of coding. >> >> >> Thank you for your time if you have finished reading thus far. :) Looking forward to any answer. >> >> Yours Sincerely, >> Antonin Klima >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com
Ista Zahn
2017-May-08 14:37 UTC
[Rd] A few suggestions and perspectives from a PhD student
On Mon, May 8, 2017 at 8:08 AM, Antonin Klima <antonink at idi.ntnu.no> wrote:> Thanks for the answers, > > I?m aware of the ?.? option, just wanted to give a very simple example. > > But the lapply ??' parameter use has eluded me and thanks for enlightening me. > > What do you mean by messing up the call stack. As far as I understand it, piping should translate into same code as deep nesting.Perhaps, but then magrittr is not really a pipe. Here is a simple example library(magrittr) data.frame(x = 1) %>% subset(y == 1) traceback()> Error in eval(e, x, parent.frame()) : object 'y' not found > 12: eval(e, x, parent.frame())11: eval(e, x, parent.frame()) 10: subset.data.frame(., y == 1) 9: subset(., y == 1) 8: function_list[[k]](value) 7: withVisible(function_list[[k]](value)) 6: freduce(value, `_function_list`) 5: `_fseq`(`_lhs`) 4: eval(quote(`_fseq`(`_lhs`)), env, env) 3: eval(quote(`_fseq`(`_lhs`)), env, env) 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) 1: data.frame(x = 1) %>% subset(y == 1)>subset(data.frame(x = 1), y == 1) traceback()> Error in eval(e, x, parent.frame()) : object 'y' not found > 4: eval(e, x, parent.frame())3: eval(e, x, parent.frame()) 2: subset.data.frame(data.frame(x = 1), y == 1) 1: subset(data.frame(x = 1), y == 1)>It does pollute the call stack, making debugging harder. So then I only see a tiny downside for debugging here. No loss of time/space efficiency or anything. With a change of inadvertent error in your example, coming from the fact that a variable is being reused and noone now checks for me whether it is being passed between the lines. And with having to specify the variable every single time. For me, that solution is clearly inferior. There are tradeoffs. As demonstrated above, the pipe is clearly inferior in that it is doing a lot of complicated stuff under the hood, and when you try to traceback() through the call stack you have to sift through all that complicated stuff. That's a pretty big drawback in my opinion.> > Too bad you didn?t find my other comments interesting though.I did not say that.> >>Why do you think being implemented in a contributed package restricts >>the usefulness of a feature? > > I guess it depends on your philosophy. It may not restrict it per say, although it would make a lot of sense to me reusing the bash-style ?|' and have a shorter, more readable version. One has extra dependence on a package for an item that fits the language so well that it should be its part. It is without doubt my most used operator at least. Going to some of my folders I found 101 uses in 750 lines, and 132 uses in 3303 lines. I would compare it to having a computer game being really good with a fan-created mod, but lacking otherwise. :)One of the key strengths of R is that packages are not akin to "fan created mods". They are a central and necessary part of the R system.> > So to me, it makes sense that if there is no doubt that a feature improves the language, and especially if people extensively use it through a package already, it should be part of the ?standard?. Question is whether it is indeed very popular, and whether you share my view. But that?s now up to you, I just wanted to point it out I guess.> > Best Regards, > Antonin > >> On 05 May 2017, at 22:33, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: >> >> Regarding the anonymous-function-in-a-pipeline point one can already >> do this which does use brackets but even so it involves fewer >> characters than the example shown. Here { . * 2 } is basically a >> lambda whose argument is dot. Would this be sufficient? >> >> library(magrittr) >> >> 1.5 %>% { . * 2 } >> ## [1] 3 >> >> Regarding currying note that with magrittr Ista's code could be written as: >> >> 1:5 %>% lapply(foo, y = 3) >> >> or at the expense of slightly more verbosity: >> >> 1:5 %>% Map(f = . %>% foo(y = 3)) >> >> >> On Fri, May 5, 2017 at 1:00 PM, Antonin Klima <antonink at idi.ntnu.no> wrote: >>> Dear Sir or Madam, >>> >>> I am in 2nd year of my PhD in bioinformatics, after taking my Master?s in computer science, and have been using R heavily during my PhD. As such, I have put together a list of certain features in R that, in my opinion, would be beneficial to add, or could be improved. The first two are already implemented in packages, but given that it is implemented as user-defined operators, it greatly restricts its usefulness. I hope you will find my suggestions interesting. If you find time, I will welcome any feedback as to whether you find the suggestions useful, or why you do not think they should be implemented. I will also welcome if you enlighten me with any features I might be unaware of, that might solve the issues I have pointed out below. >>> >>> 1) piping >>> Currently available in package magrittr, piping makes the code better readable by having the line start at its natural starting point, and following with functions that are applied - in order. The readability of several nested calls with a number of parameters each is almost zero, it?s almost as if one would need to come up with the solution himself. Pipeline in comparison is very straightforward, especially together with the point (2). >>> >>> The package here works rather good nevertheless, the shortcomings of piping not being native are not quite as severe as in point (2). Nevertheless, an intuitive symbol such as | would be helpful, and it sometimes bothers me that I have to parenthesize anonymous function, which would probably not be required in a native pipe-operator, much like it is not required in f.ex. lapply. That is, >>> 1:5 %>% function(x) x+2 >>> should be totally fine >>> >>> 2) currying >>> Currently available in package Curry. The idea is that, having a function such as foo = function(x, y) x+y, one would like to write for example lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not make a value result, but it can still give a function result - a function of y. This would be indeed most useful for various apply functions, rather than writing function(x) foo(3,x). >>> >>> I suggest that currying would make the code easier to write, and more readable, especially when using apply functions. One might imagine that there could be some confusion with such a feature, especially from people unfamiliar with functional programming, although R already does take function as first-order arguments, so it could be just fine. But one could address it with special syntax, such as $foo(3) [$foo(x=3)] for partial application. The current currying package has very limited usefulness, as, being limited by the user-defined operator framework, it only rarely can contribute to less code/more readability. Compare yourself: >>> $foo(x=3) vs foo %<% 3 >>> goo = function(a,b,c) >>> $goo(b=3) vs goo %><% list(b=3) >>> >>> Moreover, one would often like currying to have highest priority. For example, when piping: >>> data %>% foo %>% foo1 %<% 3 >>> if one wants to do data %>% foo %>% $foo(x=3) >>> >>> 3) Code executable only when running the script itself >>> Whereas the first two suggestions are somewhat stealing from Haskell and the like, this suggestion would be stealing from Python. I?m building quite a complicated pipeline, using S4 classes. After defining the class and its methods, I also define how to build the class to my likings, based on my input data, using various now-defined methods. So I end up having a list of command line arguments to process, and the way to create the class instance based on them. If I write it to the class file, however, I end up running the code when it is sourced from the next step in the pipeline, that needs the previous class definitions. >>> >>> A feature such as pythonic ?if __name__ == __main__? would thus be useful. As it is, I had to create run scripts as separate files. Which is actually not so terrible, given the class and its methods often span a few hundred lines, but still. >>> >>> 4) non-exported global variables >>> I also find it lacking, that I seem to be unable to create constants that would not get passed to files that source the class definition. That is, if class1 features global constant CONSTANT=3, then if class2 sources class1, it will also include the constant. This 1) clutters the namespace when running the code interactively, 2) potentially overwrites the constants in case of nameclash. Some kind of export/nonexport variable syntax, or symbolic import, or namespace would be useful. I know if I converted it to a package I would get at least something like a namespace, but still. >>> >>> I understand that the variable cannot just not be imported, in general, as the functions will generally rely on it (otherwise it wouldn?t have to be there). But one could consider hiding it in an implicit namespace for the file, for example. >>> >>> 5) S4 methods with same name, for different classes >>> Say I have an S4 class called datasetSingle, and another S4 class called datasetMulti, which gathers up a number of datasetSingle classes, and adds some extra functionality on top. The datasetSingle class may have a method replicates, that returns a named vector assigning replicate number to experiment names of the dataset. But I would also like to have a function with the same name for the datasetMulti class, that returns for data frame, or list, covering replicate numbers for all the datasets included. >>> >>> But then, I need to setGeneric for the method. But if I set generic before both implementations, I will reset the generic in the second call, losing the definition for ?replicates? for datasetSingle. Skipping this in the code for datasetMulti means that 1) I have to remember that I had the function defined for datasetSingle, 2) if I remove the function or change its name in datasetSingle, I now have to change the datasetMulti class file too. Moreover, if I would like to have a different generic for the datasetMulti version, I have to change it not in datasetMulti class file, but in the datasetSingle file, where it might not make much sense. In this case, I wanted to have another argument ?datasets?, which would return the replicates only for the datasets specified, rather than for all. >>> >>> I made a wrapper that could circumvent the first issue, but the second issue is not easy to circumvent. >>> >>> 6) Many parameters freeze S4 method calls >>> If I specify ca over 6 parameters for an S4 method, I would often get a ?freeze? on the method call. The process would eat up a lot of memory before going into the call, upon which it would execute the call as normal (if it didn?t run out of memory or I didn?t run out of patience). Subsequent calls of the method would not include this overhead. The amount of memory this could take could be in gigabytes, and the time in minutes. I suspect this might be due to generating an entry in call table for each accepted signature. It can be circumvented, but sure isn?t a behaviour one would expect. >>> >>> 7) Default values for S4 methods >>> It would seem that it is not possible to set up default parameters for an S4 method in a usual way of definiton = function (x, y=5). I resorted to making class unions with ?missing? for signatures on the call, with the call starting with if(missing(param)) param=DEFAULT_VALUE, but it certainly does not improve readability or ease of coding. >>> >>> >>> Thank you for your time if you have finished reading thus far. :) Looking forward to any answer. >>> >>> Yours Sincerely, >>> Antonin Klima >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >
Hadley Wickham
2017-May-08 22:29 UTC
[Rd] A few suggestions and perspectives from a PhD student
> There are tradeoffs. As demonstrated above, the pipe is clearly > inferior in that it is doing a lot of complicated stuff under the > hood, and when you try to traceback() through the call stack you have > to sift through all that complicated stuff. That's a pretty big > drawback in my opinion.To be precise, that is a problem with the current implementation of the pipe. It's not a limitation of the pipe per se. Hadley -- http://hadley.nz
Hilmar Berger
2017-May-09 07:47 UTC
[Rd] A few suggestions and perspectives from a PhD student
Hi, On 08/05/17 16:37, Ista Zahn wrote:> One of the key strengths of R is that packages are not akin to "fan > created mods". They are a central and necessary part of the R system. >I would tend to disagree here. R packages are in their majority not maintained by the core R developers. Concepts, features and lifetime depend mainly on the maintainers of the package (even though in theory GPL will allow to somebody to take over anytime). Several packages that are critical for processing big data and providing "modern" visualizations introduce concepts quite different from the legacy S/R language. I do feel that in a way, current core R shows strongly its origin in S, while modern concepts (e.g. data.table, dplyr, ggplot, ...) are often only available via extension packages. This is fine if one considers R to be a statistical toolkit; as a programming language, however, it introduces inconsistencies and uncertainties which could be avoided if some of the "modern" parts (including language concepts) could be more integrated in core-R. Best regards, Hilmar -- Dr. Hilmar Berger, MD Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin GERMANY Phone: + 49 30 28460 430 Fax: + 49 30 28460 401 E-Mail: berger at mpiib-berlin.mpg.de Web : www.mpiib-berlin.mpg.de
Apparently Analagous Threads
- Crash after (wrongly) applying product operator on object from LIMMA package
- '==' operator: inconsistency in data.frame(...) == NULL
- A few suggestions and perspectives from a PhD student
- Crash after (wrongly) applying product operator on object from LIMMA package
- Why is matrix product slower when matrix has very small values?