When running R CMD check on a package, we are encountering an error on R-devel (as of 72457) on Linux. Unfortunately, it is very hard to reproduce, and almost any change to the code makes the error go away. I believe that this is due to a bug in R-devel, which has been present since at least commit 72128 (on 2017-02-06). The test error occurs when R CMD check is run on Travis CI (on Ubuntu 12.04.05), and locally on R-devel (on Debian testing). If we make any of the following changes, the error does not occur, and the test passes: - Run the test scripts from an interactive R session - Run on a Mac instead of Linux - Use R 3.3.3 or 3.2.5 - Add code right before the tests. For example, just running gc() or format(1) right before the test causes it to not fail. However, adding very simple code like 1+1 does not affect the test (it still fails). - Run R CMD check with -d gdb or -d valgrind Because the test is so sensitive to any changes, we've been unable to track down the source of the problem. We can't be sure, but the error message suggests that a promise is getting evaluated with the wrong environment. Here's the relevant output from the tests: 1. Error: pool scheduler: schedules things in the right order (@test-scheduling.R#13) could not find function "task" 1: naiveScheduler$protect({ scheduleTask(1000, function() { results <<- c(results, 3L) }) scheduleTask(100, function() { results <<- c(results, 2L) }) scheduleTask(10, function() { results <<- c(results, 1L) }) }) at testthat/test-scheduling.R:13 2: private$refCount$release() at testthat/test-scheduling.R:13 3: private$callback() 4: self$protect({ task() }) 5: private$refCount$release() 6: private$callback() 7: self$protect({ task() }) 8: force(expr) You can see complete logs on Travis here: https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771 In the test, a re-entrant function is called (self$protect, at levels 4 and 7 on the call stack), and in the inner call (7), a variable, `task`, does not have the correct value. It should be a function, but it isn't; it's either NULL or some other non-function value. That code is here: https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81 Note that there's a `force(task)` a couple lines above, which is commented out. If that is uncommented, the test passes, even though calling force() there shouldn't make a difference. Also, calling `format(1)` at that location also causes the test to pass. I've written instructions to reproduce the problem in a Docker container here: https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f If you want to try reproducing the error without using Docker, I believe if you download pool_0.1.0.tar.gz file from the gist.github.com URL above, and run R CMD check on it, that will do it. Some upstream dependencies will likely need to be installed; they are listed in the code in the gist. We also used git bisect to try to find the first commit in R where the test would fail. The instructions for doing this are also in the gist linked to above. It found that the first commit where the test failed was 72128: https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d Taking a look at the changes in that commit, however, it seems very unlikely that it caused the error. Instead, it probably just made a change which allowed an already-existing bug to manifest. We'd appreciate any help finding the cause of this bug! -Winston
Small correction to my previous email: the error can also be replicated by installing the package and then sourcing the tests. This is slightly faster than running R CMD check. It is also how we ran the tests with gdb. This will run the tests in a Docker container and raise the same error. (To run outside of a Docker container, simply leave out the first line). docker run --rm -ti pooltest /bin/bash RD CMD INSTALL pool_0.1.0.tar.gz --install-tests cd /usr/local/lib/R/site-library/pool/tests RD -e 'source("testthat.R")' At the suggestion of Tomas Kalibera, I ran the tests with gctorture(TRUE). When I do this, I get three errors instead of just one: =====================# RD -e 'gctorture(TRUE); source("testthat.R")' R Under development (unstable) (2017-03-23 r72389) -- "Unsuffered Consequences" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.> gctorture(TRUE); source("testthat.R")1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21) --------- gc() showed 0 warnings 2. Failure: pool scheduler: schedules things in the right order (@test-scheduling.R#24) `results` not identical to 1:3. 2/3 mismatches (average diff: 1) [1] 2 - 1 == 1 [2] 1 - 2 == -1 3. Failure: pool scheduler: works with pool (@test-scheduling.R#31) ------------ pool$counters$free not identical to 1. 1/1 mismatches [1] 0 - 1 == -1 testthat results ===============================================================OK: 199 SKIPPED: 0 FAILED: 3 1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21) 2. Failure: pool scheduler: schedules things in the right order (@test-scheduling.R#24) 3. Failure: pool scheduler: works with pool (@test-scheduling.R#31) Error: testthat unit tests failed Execution halted ===================== On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <winstonchang1 at gmail.com> wrote:> When running R CMD check on a package, we are encountering an error on > R-devel (as of 72457) on Linux. Unfortunately, it is very hard to > reproduce, and almost any change to the code makes the error go away. > I believe that this is due to a bug in R-devel, which has been present > since at least commit 72128 (on 2017-02-06). > > The test error occurs when R CMD check is run on Travis CI (on Ubuntu > 12.04.05), and locally on R-devel (on Debian testing). If we make any > of the following changes, the error does not occur, and the test > passes: > - Run the test scripts from an interactive R session > - Run on a Mac instead of Linux > - Use R 3.3.3 or 3.2.5 > - Add code right before the tests. For example, just running gc() or > format(1) right before the test causes it to not fail. However, adding > very simple code like 1+1 does not affect the test (it still fails). > - Run R CMD check with -d gdb or -d valgrind > > Because the test is so sensitive to any changes, we've been unable to > track down the source of the problem. We can't be sure, but the error > message suggests that a promise is getting evaluated with the wrong > environment. > > Here's the relevant output from the tests: > 1. Error: pool scheduler: schedules things in the right order > (@test-scheduling.R#13) > could not find function "task" > 1: naiveScheduler$protect({ > scheduleTask(1000, function() { > results <<- c(results, 3L) > }) > scheduleTask(100, function() { > results <<- c(results, 2L) > }) > scheduleTask(10, function() { > results <<- c(results, 1L) > }) > }) at testthat/test-scheduling.R:13 > 2: private$refCount$release() at testthat/test-scheduling.R:13 > 3: private$callback() > 4: self$protect({ > task() > }) > 5: private$refCount$release() > 6: private$callback() > 7: self$protect({ > task() > }) > 8: force(expr) > > You can see complete logs on Travis here: > https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771 > > In the test, a re-entrant function is called (self$protect, at levels > 4 and 7 on the call stack), and in the inner call (7), a variable, > `task`, does not have the correct value. It should be a function, but > it isn't; it's either NULL or some other non-function value. > > That code is here: > https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81 > Note that there's a `force(task)` a couple lines above, which is > commented out. If that is uncommented, the test passes, even though > calling force() there shouldn't make a difference. Also, calling > `format(1)` at that location also causes the test to pass. > > > I've written instructions to reproduce the problem in a Docker container here: > https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f > > If you want to try reproducing the error without using Docker, I > believe if you download pool_0.1.0.tar.gz file from the > gist.github.com URL above, and run R CMD check on it, that will do it. > Some upstream dependencies will likely need to be installed; they are > listed in the code in the gist. > > > We also used git bisect to try to find the first commit in R where the > test would fail. The instructions for doing this are also in the gist > linked to above. It found that the first commit where the test failed > was 72128: > https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d > > Taking a look at the changes in that commit, however, it seems very > unlikely that it caused the error. Instead, it probably just made a > change which allowed an already-existing bug to manifest. > > > We'd appreciate any help finding the cause of this bug! > -Winston
> > >>I've done some more investigation into the problem, and it is very difficult to pin down. What it looks like is happening is roughly like this: - `p` is an environment and `p$e` is also an environment. - There is a loop. In each iteration, it looks for one item in `p$e`, saves it in a variable `x`, then removes that item from `p$e`. Then it invokes `x()`. The loop runs again, until there are no more items in `p$e`. The problem is that `ls(p$e)` sometimes returns the wrong values -- it returns the values that it had in previous iterations of the loop. The behavior is very touchy. Almost any change to the code will slightly change the behavior; sometimes the `ls()` returns values from a different iteration of the loop, and sometimes the problem doesn't happen at all. I've put a Dockerfile and instructions for reproducing the problem here: https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88 I think that I've gotten about as far with this as I can, though I'd be happy to provide more information if anyone wants to take look at the problem. -Winston [[alternative HTML version deleted]]
With Luke Tierney's assistance, we found that the root cause of the problem was not R, but a race condition in the pool package that occurred when a finalizer ran. Below is a very contrived and simplified example to illustrate: gc() n <- 1 increment_n <- function() { cat(n, "-", sep = "") # Do some stuff that allocates memory. Sometimes this triggers a GC. lapply(rnorm(2e4), mean) cat(n, "\n") n <<- n + 1 } # Create an object, and add a finalizer that simply calls increment_n e <- new.env() reg.finalizer(e, function(e) { increment_n() }) rm(e) for (i in 1:5) { increment_n() } On the surface, it looks like this should print out "1-1", "2-2", and so on, on separate lines. But when I run it (on my Mac), it outputs this: 1-1 2-2 3-3-3 4 5-5 6-6 On the positive side, there are some lessons that can be taken from this. If you encounter a seemingly random and hard-to-diagnose problem, it could be due to a finalizer. Most of us think about our R code as straight line code that runs on a single thread. We don't even consider things like race conditions, since they are problems usually encountered in multithreaded environments. But finalizers violate this mental model. Because they can unexpectedly interrupt your code and execute other code, you can end up with some of the same problems that you would face in a multithreaded environment. Thanks to everyone who spent time helping to diagnose the problem! -Winston On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <winstonchang1 at gmail.com> wrote:> When running R CMD check on a package, we are encountering an error on > R-devel (as of 72457) on Linux. Unfortunately, it is very hard to > reproduce, and almost any change to the code makes the error go away. > I believe that this is due to a bug in R-devel, which has been present > since at least commit 72128 (on 2017-02-06). > > The test error occurs when R CMD check is run on Travis CI (on Ubuntu > 12.04.05), and locally on R-devel (on Debian testing). If we make any > of the following changes, the error does not occur, and the test > passes: > - Run the test scripts from an interactive R session > - Run on a Mac instead of Linux > - Use R 3.3.3 or 3.2.5 > - Add code right before the tests. For example, just running gc() or > format(1) right before the test causes it to not fail. However, adding > very simple code like 1+1 does not affect the test (it still fails). > - Run R CMD check with -d gdb or -d valgrind > > Because the test is so sensitive to any changes, we've been unable to > track down the source of the problem. We can't be sure, but the error > message suggests that a promise is getting evaluated with the wrong > environment. > > Here's the relevant output from the tests: > 1. Error: pool scheduler: schedules things in the right order > (@test-scheduling.R#13) > could not find function "task" > 1: naiveScheduler$protect({ > scheduleTask(1000, function() { > results <<- c(results, 3L) > }) > scheduleTask(100, function() { > results <<- c(results, 2L) > }) > scheduleTask(10, function() { > results <<- c(results, 1L) > }) > }) at testthat/test-scheduling.R:13 > 2: private$refCount$release() at testthat/test-scheduling.R:13 > 3: private$callback() > 4: self$protect({ > task() > }) > 5: private$refCount$release() > 6: private$callback() > 7: self$protect({ > task() > }) > 8: force(expr) > > You can see complete logs on Travis here: > https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771 > > In the test, a re-entrant function is called (self$protect, at levels > 4 and 7 on the call stack), and in the inner call (7), a variable, > `task`, does not have the correct value. It should be a function, but > it isn't; it's either NULL or some other non-function value. > > That code is here: > https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81 > Note that there's a `force(task)` a couple lines above, which is > commented out. If that is uncommented, the test passes, even though > calling force() there shouldn't make a difference. Also, calling > `format(1)` at that location also causes the test to pass. > > > I've written instructions to reproduce the problem in a Docker container > here: > https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f > > If you want to try reproducing the error without using Docker, I > believe if you download pool_0.1.0.tar.gz file from the > gist.github.com URL above, and run R CMD check on it, that will do it. > Some upstream dependencies will likely need to be installed; they are > listed in the code in the gist. > > > We also used git bisect to try to find the first commit in R where the > test would fail. The instructions for doing this are also in the gist > linked to above. It found that the first commit where the test failed > was 72128: > https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82 > 226644953d > > Taking a look at the changes in that commit, however, it seems very > unlikely that it caused the error. Instead, it probably just made a > change which allowed an already-existing bug to manifest. > > > We'd appreciate any help finding the cause of this bug! > -Winston >[[alternative HTML version deleted]]