When running R CMD check on a package, we are encountering an error on
R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
reproduce, and almost any change to the code makes the error go away.
I believe that this is due to a bug in R-devel, which has been present
since at least commit 72128 (on 2017-02-06).
The test error occurs when R CMD check is run on Travis CI (on Ubuntu
12.04.05), and locally on R-devel (on Debian testing). If we make any
of the following changes, the error does not occur, and the test
passes:
- Run the test scripts from an interactive R session
- Run on a Mac instead of Linux
- Use R 3.3.3 or 3.2.5
- Add code right before the tests. For example, just running gc() or
format(1) right before the test causes it to not fail. However, adding
very simple code like 1+1 does not affect the test (it still fails).
- Run R CMD check with -d gdb or -d valgrind
Because the test is so sensitive to any changes, we've been unable to
track down the source of the problem. We can't be sure, but the error
message suggests that a promise is getting evaluated with the wrong
environment.
Here's the relevant output from the tests:
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
scheduleTask(1000, function() {
results <<- c(results, 3L)
})
scheduleTask(100, function() {
results <<- c(results, 2L)
})
scheduleTask(10, function() {
results <<- c(results, 1L)
})
}) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()
4: self$protect({
task()
})
5: private$refCount$release()
6: private$callback()
7: self$protect({
task()
})
8: force(expr)
You can see complete logs on Travis here:
https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
In the test, a re-entrant function is called (self$protect, at levels
4 and 7 on the call stack), and in the inner call (7), a variable,
`task`, does not have the correct value. It should be a function, but
it isn't; it's either NULL or some other non-function value.
That code is here:
https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
Note that there's a `force(task)` a couple lines above, which is
commented out. If that is uncommented, the test passes, even though
calling force() there shouldn't make a difference. Also, calling
`format(1)` at that location also causes the test to pass.
I've written instructions to reproduce the problem in a Docker container
here:
https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
If you want to try reproducing the error without using Docker, I
believe if you download pool_0.1.0.tar.gz file from the
gist.github.com URL above, and run R CMD check on it, that will do it.
Some upstream dependencies will likely need to be installed; they are
listed in the code in the gist.
We also used git bisect to try to find the first commit in R where the
test would fail. The instructions for doing this are also in the gist
linked to above. It found that the first commit where the test failed
was 72128:
https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d
Taking a look at the changes in that commit, however, it seems very
unlikely that it caused the error. Instead, it probably just made a
change which allowed an already-existing bug to manifest.
We'd appreciate any help finding the cause of this bug!
-Winston
Small correction to my previous email: the error can also be
replicated by installing the package and then sourcing the tests. This
is slightly faster than running R CMD check. It is also how we ran the
tests with gdb. This will run the tests in a Docker container and
raise the same error. (To run outside of a Docker container, simply
leave out the first line).
docker run --rm -ti pooltest /bin/bash
RD CMD INSTALL pool_0.1.0.tar.gz --install-tests
cd /usr/local/lib/R/site-library/pool/tests
RD -e 'source("testthat.R")'
At the suggestion of Tomas Kalibera, I ran the tests with
gctorture(TRUE). When I do this, I get three errors instead of just
one:
=====================# RD -e 'gctorture(TRUE);
source("testthat.R")'
R Under development (unstable) (2017-03-23 r72389) -- "Unsuffered
Consequences"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> gctorture(TRUE); source("testthat.R")
1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21) ---------
gc() showed 0 warnings
2. Failure: pool scheduler: schedules things in the right order
(@test-scheduling.R#24)
`results` not identical to 1:3.
2/3 mismatches (average diff: 1)
[1] 2 - 1 == 1
[2] 1 - 2 == -1
3. Failure: pool scheduler: works with pool (@test-scheduling.R#31) ------------
pool$counters$free not identical to 1.
1/1 mismatches
[1] 0 - 1 == -1
testthat results
===============================================================OK: 199 SKIPPED:
0 FAILED: 3
1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21)
2. Failure: pool scheduler: schedules things in the right order
(@test-scheduling.R#24)
3. Failure: pool scheduler: works with pool (@test-scheduling.R#31)
Error: testthat unit tests failed
Execution halted
=====================
On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <winstonchang1 at
gmail.com> wrote:> When running R CMD check on a package, we are encountering an error on
> R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
> reproduce, and almost any change to the code makes the error go away.
> I believe that this is due to a bug in R-devel, which has been present
> since at least commit 72128 (on 2017-02-06).
>
> The test error occurs when R CMD check is run on Travis CI (on Ubuntu
> 12.04.05), and locally on R-devel (on Debian testing). If we make any
> of the following changes, the error does not occur, and the test
> passes:
> - Run the test scripts from an interactive R session
> - Run on a Mac instead of Linux
> - Use R 3.3.3 or 3.2.5
> - Add code right before the tests. For example, just running gc() or
> format(1) right before the test causes it to not fail. However, adding
> very simple code like 1+1 does not affect the test (it still fails).
> - Run R CMD check with -d gdb or -d valgrind
>
> Because the test is so sensitive to any changes, we've been unable to
> track down the source of the problem. We can't be sure, but the error
> message suggests that a promise is getting evaluated with the wrong
> environment.
>
> Here's the relevant output from the tests:
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
> scheduleTask(1000, function() {
> results <<- c(results, 3L)
> })
> scheduleTask(100, function() {
> results <<- c(results, 2L)
> })
> scheduleTask(10, function() {
> results <<- c(results, 1L)
> })
> }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
> 4: self$protect({
> task()
> })
> 5: private$refCount$release()
> 6: private$callback()
> 7: self$protect({
> task()
> })
> 8: force(expr)
>
> You can see complete logs on Travis here:
> https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
>
> In the test, a re-entrant function is called (self$protect, at levels
> 4 and 7 on the call stack), and in the inner call (7), a variable,
> `task`, does not have the correct value. It should be a function, but
> it isn't; it's either NULL or some other non-function value.
>
> That code is here:
> https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
> Note that there's a `force(task)` a couple lines above, which is
> commented out. If that is uncommented, the test passes, even though
> calling force() there shouldn't make a difference. Also, calling
> `format(1)` at that location also causes the test to pass.
>
>
> I've written instructions to reproduce the problem in a Docker
container here:
> https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
>
> If you want to try reproducing the error without using Docker, I
> believe if you download pool_0.1.0.tar.gz file from the
> gist.github.com URL above, and run R CMD check on it, that will do it.
> Some upstream dependencies will likely need to be installed; they are
> listed in the code in the gist.
>
>
> We also used git bisect to try to find the first commit in R where the
> test would fail. The instructions for doing this are also in the gist
> linked to above. It found that the first commit where the test failed
> was 72128:
>
https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d
>
> Taking a look at the changes in that commit, however, it seems very
> unlikely that it caused the error. Instead, it probably just made a
> change which allowed an already-existing bug to manifest.
>
>
> We'd appreciate any help finding the cause of this bug!
> -Winston
> > >>I've done some more investigation into the problem, and it is very difficult to pin down. What it looks like is happening is roughly like this: - `p` is an environment and `p$e` is also an environment. - There is a loop. In each iteration, it looks for one item in `p$e`, saves it in a variable `x`, then removes that item from `p$e`. Then it invokes `x()`. The loop runs again, until there are no more items in `p$e`. The problem is that `ls(p$e)` sometimes returns the wrong values -- it returns the values that it had in previous iterations of the loop. The behavior is very touchy. Almost any change to the code will slightly change the behavior; sometimes the `ls()` returns values from a different iteration of the loop, and sometimes the problem doesn't happen at all. I've put a Dockerfile and instructions for reproducing the problem here: https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88 I think that I've gotten about as far with this as I can, though I'd be happy to provide more information if anyone wants to take look at the problem. -Winston [[alternative HTML version deleted]]
With Luke Tierney's assistance, we found that the root cause of the problem
was not R, but a race condition in the pool package that occurred when a
finalizer ran.
Below is a very contrived and simplified example to illustrate:
gc()
n <- 1
increment_n <- function() {
cat(n, "-", sep = "")
# Do some stuff that allocates memory. Sometimes this triggers a GC.
lapply(rnorm(2e4), mean)
cat(n, "\n")
n <<- n + 1
}
# Create an object, and add a finalizer that simply calls increment_n
e <- new.env()
reg.finalizer(e, function(e) { increment_n() })
rm(e)
for (i in 1:5) {
increment_n()
}
On the surface, it looks like this should print out "1-1",
"2-2", and so
on, on separate lines. But when I run it (on my Mac), it outputs this:
1-1
2-2
3-3-3
4
5-5
6-6
On the positive side, there are some lessons that can be taken from this.
If you encounter a seemingly random and hard-to-diagnose problem, it could
be due to a finalizer. Most of us think about our R code as straight line
code that runs on a single thread. We don't even consider things like race
conditions, since they are problems usually encountered in multithreaded
environments. But finalizers violate this mental model. Because they can
unexpectedly interrupt your code and execute other code, you can end up
with some of the same problems that you would face in a multithreaded
environment.
Thanks to everyone who spent time helping to diagnose the problem!
-Winston
On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <winstonchang1 at
gmail.com>
wrote:
> When running R CMD check on a package, we are encountering an error on
> R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
> reproduce, and almost any change to the code makes the error go away.
> I believe that this is due to a bug in R-devel, which has been present
> since at least commit 72128 (on 2017-02-06).
>
> The test error occurs when R CMD check is run on Travis CI (on Ubuntu
> 12.04.05), and locally on R-devel (on Debian testing). If we make any
> of the following changes, the error does not occur, and the test
> passes:
> - Run the test scripts from an interactive R session
> - Run on a Mac instead of Linux
> - Use R 3.3.3 or 3.2.5
> - Add code right before the tests. For example, just running gc() or
> format(1) right before the test causes it to not fail. However, adding
> very simple code like 1+1 does not affect the test (it still fails).
> - Run R CMD check with -d gdb or -d valgrind
>
> Because the test is so sensitive to any changes, we've been unable to
> track down the source of the problem. We can't be sure, but the error
> message suggests that a promise is getting evaluated with the wrong
> environment.
>
> Here's the relevant output from the tests:
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
> scheduleTask(1000, function() {
> results <<- c(results, 3L)
> })
> scheduleTask(100, function() {
> results <<- c(results, 2L)
> })
> scheduleTask(10, function() {
> results <<- c(results, 1L)
> })
> }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
> 4: self$protect({
> task()
> })
> 5: private$refCount$release()
> 6: private$callback()
> 7: self$protect({
> task()
> })
> 8: force(expr)
>
> You can see complete logs on Travis here:
> https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
>
> In the test, a re-entrant function is called (self$protect, at levels
> 4 and 7 on the call stack), and in the inner call (7), a variable,
> `task`, does not have the correct value. It should be a function, but
> it isn't; it's either NULL or some other non-function value.
>
> That code is here:
> https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
> Note that there's a `force(task)` a couple lines above, which is
> commented out. If that is uncommented, the test passes, even though
> calling force() there shouldn't make a difference. Also, calling
> `format(1)` at that location also causes the test to pass.
>
>
> I've written instructions to reproduce the problem in a Docker
container
> here:
> https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
>
> If you want to try reproducing the error without using Docker, I
> believe if you download pool_0.1.0.tar.gz file from the
> gist.github.com URL above, and run R CMD check on it, that will do it.
> Some upstream dependencies will likely need to be installed; they are
> listed in the code in the gist.
>
>
> We also used git bisect to try to find the first commit in R where the
> test would fail. The instructions for doing this are also in the gist
> linked to above. It found that the first commit where the test failed
> was 72128:
> https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82
> 226644953d
>
> Taking a look at the changes in that commit, however, it seems very
> unlikely that it caused the error. Instead, it probably just made a
> change which allowed an already-existing bug to manifest.
>
>
> We'd appreciate any help finding the cause of this bug!
> -Winston
>
[[alternative HTML version deleted]]