Sebastian Martin Krantz
2019-Feb-26  13:25 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
Dear Developers, Having spent time developing and thinking about how data aggregation and summary statistics can be enhanced in R, I would like to present my ideas/efforts in the form of two commands: The first, which for now I called 'collap', is an upgrade of aggregate that accommodates and extends the functionality of aggregate in various respects, most importantly to work with multilevel and multi-type data, multiple function calls, highly customized aggregation tasks, a much greater flexibility in the passing of inputs and tidy output. The second function, 'qsu', is an advanced and flexible summary command for cross-sectional and multilevel (panel) data (i.e. it can provide overall, between and within entities statistics, and allows for grouping, custom functions and transformations). It also provides a quick method to compute and output within-transformed data. Both commands are efficiently built from core R, but provide for optional integration with data.table, which renders them extremely fast on large datasets. An explanation of the syntax, a demonstration and benchmark results are provided in the attached vignette. Since both commands accommodate existing functionality while adding significant basic functionality, I though that their addition to the stats package would be a worthwhile consideration. I am happy for your feedback. Best regards, Sebastian Krantz -------------- next part -------------- A non-text attachment was scrubbed... Name: collap & qsu vignette.pdf Type: application/pdf Size: 569278 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20190226/eb4dd92d/attachment.pdf>
IƱaki Ucar
2019-Feb-27  09:23 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
On Wed, 27 Feb 2019 at 09:02, Sebastian Martin Krantz <sebastian.krantz at graduateinstitute.ch> wrote:> > Dear Developers, > > Having spent time developing and thinking about how data aggregation and > summary statistics can be enhanced in R, I would like to present my > ideas/efforts in the form of two commands: > > The first, which for now I called 'collap', is an upgrade of aggregate that > accommodates and extends the functionality of aggregate in various > respects, most importantly to work with multilevel and multi-type data, > multiple function calls, highly customized aggregation tasks, a much > greater flexibility in the passing of inputs and tidy output. > > The second function, 'qsu', is an advanced and flexible summary command for > cross-sectional and multilevel (panel) data (i.e. it can provide overall, > between and within entities statistics, and allows for grouping, custom > functions and transformations). It also provides a quick method to compute > and output within-transformed data. > > Both commands are efficiently built from core R, but provide for optional > integration with data.table, which renders them extremely fast on large > datasets. An explanation of the syntax, a demonstration and benchmark > results are provided in the attached vignette.Looks interesting. Sorry if it's there and I didn't find it: is there any package implementing these functions so that we can try them? I?aki
Joris Meys
2019-Feb-27  10:17 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
Dear Sebastian, Initially I was a bit hesitant to think about yet another way to summarize data, but your illustrations convinced me this is actually a great addition to the toolset currently available in different R packages. Many of us have written custom functions to get the required tables for specific data sets, but this would reduce that effort to simply using the right collap() call. Like Inaki, I'm very interested in trying it out if you have the code available somewhere. Cheers Joris On Wed, Feb 27, 2019 at 9:01 AM Sebastian Martin Krantz < sebastian.krantz at graduateinstitute.ch> wrote:> Dear Developers, > > Having spent time developing and thinking about how data aggregation and > summary statistics can be enhanced in R, I would like to present my > ideas/efforts in the form of two commands: > > The first, which for now I called 'collap', is an upgrade of aggregate that > accommodates and extends the functionality of aggregate in various > respects, most importantly to work with multilevel and multi-type data, > multiple function calls, highly customized aggregation tasks, a much > greater flexibility in the passing of inputs and tidy output. > > The second function, 'qsu', is an advanced and flexible summary command for > cross-sectional and multilevel (panel) data (i.e. it can provide overall, > between and within entities statistics, and allows for grouping, custom > functions and transformations). It also provides a quick method to compute > and output within-transformed data. > > Both commands are efficiently built from core R, but provide for optional > integration with data.table, which renders them extremely fast on large > datasets. An explanation of the syntax, a demonstration and benchmark > results are provided in the attached vignette. > > Since both commands accommodate existing functionality while adding > significant basic functionality, I though that their addition to the stats > package would be a worthwhile consideration. I am happy for your feedback. > > Best regards, > > Sebastian Krantz > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2018-2019 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Sebastian Martin Krantz
2019-Feb-27  10:44 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
Dear I?aki and Joris, thank you for the positive feedback! I had attached a code file to the post, but apparently it was removed. I will attach it again to this e-mail, otherwise both vignette and code can be downloaded from the following link: https://www.dropbox.com/sh/s0k1tiz7el55g1q/AACpri-nruXjcMwUnNcHoycKa?dl=0 Best, Sebastian On Wed, 27 Feb 2019 at 11:14, Joris Meys <jorismeys at gmail.com> wrote:> Dear Sebastian, > > Initially I was a bit hesitant to think about yet another way to summarize > data, but your illustrations convinced me this is actually a great addition > to the toolset currently available in different R packages. Many of us have > written custom functions to get the required tables for specific data sets, > but this would reduce that effort to simply using the right collap() call. > > Like Inaki, I'm very interested in trying it out if you have the code > available somewhere. > > Cheers > Joris > > > > > > On Wed, Feb 27, 2019 at 9:01 AM Sebastian Martin Krantz < > sebastian.krantz at graduateinstitute.ch> wrote: > >> Dear Developers, >> >> Having spent time developing and thinking about how data aggregation and >> summary statistics can be enhanced in R, I would like to present my >> ideas/efforts in the form of two commands: >> >> The first, which for now I called 'collap', is an upgrade of aggregate >> that >> accommodates and extends the functionality of aggregate in various >> respects, most importantly to work with multilevel and multi-type data, >> multiple function calls, highly customized aggregation tasks, a much >> greater flexibility in the passing of inputs and tidy output. >> >> The second function, 'qsu', is an advanced and flexible summary command >> for >> cross-sectional and multilevel (panel) data (i.e. it can provide overall, >> between and within entities statistics, and allows for grouping, custom >> functions and transformations). It also provides a quick method to compute >> and output within-transformed data. >> >> Both commands are efficiently built from core R, but provide for optional >> integration with data.table, which renders them extremely fast on large >> datasets. An explanation of the syntax, a demonstration and benchmark >> results are provided in the attached vignette. >> >> Since both commands accommodate existing functionality while adding >> significant basic functionality, I though that their addition to the stats >> package would be a worthwhile consideration. I am happy for your feedback. >> >> Best regards, >> >> Sebastian Krantz >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2018-2019 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >
Duncan Murdoch
2019-Feb-27  11:48 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
On 26/02/2019 8:25 a.m., Sebastian Martin Krantz wrote:> Dear Developers, > > Having spent time developing and thinking about how data aggregation and > summary statistics can be enhanced in R, I would like to present my > ideas/efforts in the form of two commands: > > The first, which for now I called 'collap', is an upgrade of aggregate that > accommodates and extends the functionality of aggregate in various > respects, most importantly to work with multilevel and multi-type data, > multiple function calls, highly customized aggregation tasks, a much > greater flexibility in the passing of inputs and tidy output. > > The second function, 'qsu', is an advanced and flexible summary command for > cross-sectional and multilevel (panel) data (i.e. it can provide overall, > between and within entities statistics, and allows for grouping, custom > functions and transformations). It also provides a quick method to compute > and output within-transformed data. > > Both commands are efficiently built from core R, but provide for optional > integration with data.table, which renders them extremely fast on large > datasets. An explanation of the syntax, a demonstration and benchmark > results are provided in the attached vignette. > > Since both commands accommodate existing functionality while adding > significant basic functionality, I though that their addition to the stats > package would be a worthwhile consideration. I am happy for your feedback.Generally the R Core group is reluctant to incorporate new functions into the base packages. Each function that is added adds to their work, and they already have too much to do. (I am no longer a member of R Core, but I don't think things have changed since I retired.) It is much easier for them if volunteers publish functions themselves, via contributed packages. Nowadays Github provides a very convenient platform on which you can develop a package containing your functions. If other users find bugs or have suggested improvements, it's very easy for them to send those to you, and you can make the fixes available immediately. Once you are satisfied that it is stable, you can submit it to CRAN, and anyone using R can easily install it. If you find the prospect of writing a package daunting, you shouldn't. It's actually quite easy, especially if you are using RStudio or ESS (or some other helpful front-end.) Hadley Wickham's book <http://r-pkgs.had.co.nz/> is a pretty accessible description of a development strategy. (It's not the only strategy, but lots of people use it.) Duncan Murdoch
Sebastian Martin Krantz
2019-Feb-28  09:06 UTC
[Rd] Improved Data Aggregation and Summary Statistics in R
Thanks to all who gave feedback so far, there is now a version of the
package on Github, it can be installed by
remotes::install_github("SebKrantz/collapse")
further feedback is still very welcome!
On Wed, 27 Feb 2019 at 12:48, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 26/02/2019 8:25 a.m., Sebastian Martin Krantz wrote:
> > Dear Developers,
> >
> > Having spent time developing and thinking about how data aggregation
and
> > summary statistics can be enhanced in R, I would like to present my
> > ideas/efforts in the form of two commands:
> >
> > The first, which for now I called 'collap', is an upgrade of
aggregate
> that
> > accommodates and extends the functionality of aggregate in various
> > respects, most importantly to work with multilevel and multi-type
data,
> > multiple function calls, highly customized aggregation tasks, a much
> > greater flexibility in the passing of inputs and tidy output.
> >
> > The second function, 'qsu', is an advanced and flexible
summary command
> for
> > cross-sectional and multilevel (panel) data (i.e. it can provide
overall,
> > between and within entities statistics, and allows for grouping,
custom
> > functions and transformations). It also provides a quick method to
> compute
> > and output within-transformed data.
> >
> > Both commands are efficiently built from core R, but provide for
optional
> > integration with data.table, which renders them extremely fast on
large
> > datasets. An explanation of the syntax, a demonstration and benchmark
> > results are provided in the attached vignette.
> >
> > Since both commands accommodate existing functionality while adding
> > significant basic functionality, I though that their addition to the
> stats
> > package would be a worthwhile consideration. I am happy for your
> feedback.
>
> Generally the R Core group is reluctant to incorporate new functions
> into the base packages.  Each function that is added adds to their work,
> and they already have too much to do.  (I am no longer a member of R
> Core, but I don't think things have changed since I retired.)
>
> It is much easier for them if volunteers publish functions themselves,
> via contributed packages.
>
> Nowadays Github provides a very convenient platform on which you can
> develop a package containing your functions.  If other users find bugs
> or have suggested improvements, it's very easy for them to send those
to
> you, and you can make the fixes available immediately.  Once you are
> satisfied that it is stable, you can submit it to CRAN, and anyone using
> R can easily install it.
>
> If you find the prospect of writing a package daunting, you shouldn't.
> It's actually quite easy, especially if you are using RStudio or ESS
(or
> some other helpful front-end.)  Hadley Wickham's book
> <http://r-pkgs.had.co.nz/> is a pretty accessible description of a
> development strategy.  (It's not the only strategy, but lots of people
> use it.)
>
> Duncan Murdoch
>
	[[alternative HTML version deleted]]