Hi everyone,
Ivan and I had a few discussions several months ago regarding `dendrapply`, but
now that I've had the chance to work on it more specifically and discuss it
Martin Maechler at the R Project Sprint, I figured it would be a good idea to
start a new email chain.
I've refactored `dendrapply`, and the current implementation is available in
my bugzilla report (https://bugs.r-project.org/show_bug.cgi?id=18480). This
project started due to the help page for `dendrapply`, which specifically
requested users to contribute improvements to the function. I'm including a
lot of writeup here because I'm very aware that this is a relatively large
contribution for someone that doesn't have a history of contributing a lot
of code to base, and I'd like to justify the inclusion.
Feel free to skip everything I've written and instead use the following
links. A thorough discussion follows.
- Bugzilla with patch: https://bugs.r-project.org/show_bug.cgi?id=18480
- R Checks: https://github.com/r-devel/r-svn/pull/111
- Discussion at R Project Sprint:
https://github.com/r-devel/r-project-sprint-2023/discussions/6
- Original blog post about this (long, code out of date, but has a simpler
explanation of the implementation):
https://www.ahl27.com/posts/2023/02/dendrapply/
Responses to common questions:
- Why does this project need to be done?
The current implementation in `stats::dendrapply` is recursive, and thus has
issues with deeply nested dendrogram objects. As of 2015, users experienced
issues with recursive operations on dendrograms causing stack overflow issues
(see https://bugs.r-project.org/show_bug.cgi?id=15215). This has been alleviated
by better computers and short-term workarounds, but many users have limited
resources and/or need for large trees. Even with sufficient memory, a recursive
implementation incurs a nontrivial amount of unneccessary computational
overhead. I'll also add that this is a feature that was requested in R
itself (see Note section in `?dendrapply`), and Martin Maechler has been
supportive of the work thus far on it.
- What does this implementation do?
There are a few improvements in this implementation. The first is that function
application to dendrogram objects is no longer recursive. This implementation is
also based in C, providing a performance boost of at least 4x (see later
question for details). Finally, iterative application of functions in C allows
for flexibility in *how* the dendrogram is traversed, which gives end-users a
significant amount of power to work with tree-like structures in an intuitive
way. An easy example is subsetting based on leaf values--if a user wanted to
subset a dendrogram to only certain leaves, there isn't a good way to do
this in base R (even with dendrapply). `how='post.order'` fixes this
problem. I'm happy to provide additional examples if needed.
- Why C? This is harder to maintain than R.
This is a valid point. I did my best to include as much documentation as
possible, and I'm also volunteering myself to help maintain this function. C
provides a lot of power in working with dendrograms, since R's toolkit for
tree-like structures is relatively lacking. This refactor is theoretically
doable in R, but the implementation would involve an immense amount of memory
overhead to ensure we can preserve the node states as we traverse the tree.
There is precedence for a C implementation of dendrapply (see other `*apply`
functions). Additionally, this decreases function application overhead, and
allows future extensions to be written purely in R in a much simpler way. I
think this tradeoff is worth it, but I am happy to discuss implementation
specifics with anyone that is interested.
- Ivan previously mentioned issues with user specific `[[.dendrogram`
implementations, and it doesn't seem that you've fixed that.
This is correct. I discovered during the R project sprint that
`stats::dendrapply` does not respect user-specific implementations of
`[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the user defines
multiple classes for a dendrogram object, double bracket subsetting will remove
the class (a bug report will be filed for this shortly). My implementation
exactly replicates the performance of stats::`[[.dendrogram`, and if users are
in need of a function that can respect custom subset overloading, I can address
those feature requests if/when they are submitted.
- Backwards compatibility?>From current testing, this is a drop-in replacement for `stats::dendrapply`.
R builds successfully, and all >400 tests in the CRAN package that uses
`dendrapply` the most (dendextend) pass with no changes from the original. The
additional argument `how=c('pre.order', 'post.order')` is the
same syntax as `rapply`, and additional documentation has been added to the
`dendrapply.Rd` to reflect this change. This is still an unfinished TODO; the
internal R testing for `dendrapply` is very sparse. I haven't been able to
find any differences between stats::dendrapply and this implementation, but I am
planning to run a full check against all CRAN packages that use `dendrapply`.
I'm also planning to add additional regression testing to R either as part
of this patch in a separate patch.
- You mentioned there was more to the listed '4x improvement'
Yes. I haven't yet put together a comprehensive benchmark for highly
unbalanced trees, and in truth there are so many possible tree structures that
it would be challenging to test them all. However, on trees with 5 leaves the
performance is roughly identical to that of `stats`, and benchmarking with
`microbenchmark` demonstrates performance gains of roughly 5x on fully balanced
trees with 10-5000 leaves. This should be a lower bound for performance, since
fully balanced trees minimize internal nodes and thus have less recursion...so
on reasonably sized trees of arbitrary structure we should have at least around
a 4x improvement. I'll also stress that the focus of this patch is not a
runtime improvement--it's nice that we get a speedup, but the added value
here is the removal of recursive calls.
- Why not just put this in a package?
I think there's value in fleshing out the structures included in base R.
Dendrograms are a general tree structure, and few programming languages provide
support for these out of the box. Dendrogram objects are currently rarely used,
but with a little bit of additional functionality, they could be a very powerful
tool for users. These have applications in a variety of fields that are not just
phylogenetics; implementations of domain-specific tools (e.g. `ape`, `DECIPHER`)
are better suited to 3rd party packages. However, `dendrograms` already exist in
base and have poor support, which is even admitted in their help files.
`dendrapply`. While it's currently limited to acting on dendrogram objects,
a solid implementation would open the door to generalizing dendrapply to work on
any nested list. This is my personal opinion, and there is certainly an argument
to be made that `dendrapply` (and even `dendrogram` as a whole) could live
outside of base.
- Why/how were the included traversal strategies chosen?
The default, pre.order, was chosen because it replicates existing functionality.
post.order was included because it lends itself well to a lot of applications.
Between these two methods, we have a way to apply a function to trees ensuring
that parents are evaluated before children, and ensuring children are evaluated
before parents. Future extensions to support an in.order or BFS/level.order
traversal is definitely an option, but I don't think the added
implementation effort and complexity adds a lot of functionality over the two
that have been included.
There's also been previous comments regarding the structure of dendrograms.
While hclust will return a bifurcating tree, dendrogram objects (and dendrapply)
support arbitrary multifurcations and edge weights.
Apologies for the lengthy email. If you've read even half, thanks for your
time. There?s likely some optimizations that could be made in the C code dealing
with R structures; I?m still learning the intricacies of some of the more
specific points of this. One that comes to mind is converting the repeated
`lang2` calls to instead initialize a `lang2` call and then change the symbol in
the call, as is done in `lapply` (and was mentioned by Ivan in my last
submission). I?ll test that as well.
Further feedback is welcome and much appreciated.
-Aidan
-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Fellow, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 at pitt.edu
(724) 612-9940
[[alternative HTML version deleted]]