René J.V. Bertin
2006-Oct-12 16:08 UTC
[R] multithreading calling from the rpy Python package
Hello, I don't know if this question ought to go here, or rather on R-devel, so please bear with me. I'm interfacing to R via RPy (rpy.sf.net) and an embedded Python interpreter. This is really quite convenient. I use this approach to calculate the correlation coefficient of 1 independent dataset (vector) with 4 dependent vectors. It'd be nice if that could be done in 4 parallel threads, or even two. As long as I stick to pure Python code (using equivalents to R routines that can be found in Numpy and SciPy), this works fine. (Tested on a single-core machine.) However, when I call R functions through rpy, a crash will occur at some point, with the error *** caught segfault *** address 0x5164000, cause 'memory not mapped' (this is on Mac OS X 10.4.8), somewhere in Rf_eval: Thread 4 Crashed: 0 libR.dylib 0x03676af0 Rf_eval + 128 1 libR.dylib 0x03676e6c Rf_eval + 1020 2 libR.dylib 0x03677108 Rf_eval + 1688 3 libR.dylib 0x03676e6c Rf_eval + 1020 4 libR.dylib 0x03677108 Rf_eval + 1688 5 libR.dylib 0x03676e6c Rf_eval + 1020 6 libR.dylib 0x03677108 Rf_eval + 1688 7 libR.dylib 0x03678144 Rf_evalList + 148 8 libR.dylib 0x036bb5cc do_internal + 796 9 libR.dylib 0x03676fbc Rf_eval + 1356 10 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 11 libR.dylib 0x03676e3c Rf_eval + 972 12 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 13 libR.dylib 0x03676e3c Rf_eval + 972 14 libR.dylib 0x0367a110 do_if + 48 15 libR.dylib 0x03676fbc Rf_eval + 1356 16 libR.dylib 0x0367932c do_begin + 108 17 libR.dylib 0x03676fbc Rf_eval + 1356 18 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 19 libR.dylib 0x03676e3c Rf_eval + 972 20 libR.dylib 0x0361b7c0 protectedEval + 64 21 libR.dylib 0x0361c170 R_ToplevelExec + 544 22 libR.dylib 0x0361c22c R_tryEval + 60 23 _rpy2031.so 0x032f0b8c do_eval_expr + 108>> 24 _rpy2031.so 0x032ef950 Robj_call + 68825 Python2.5 0x023c6c08 PyObject_Call + 56 26 Python2.5 0x024a68ec PyEval_EvalFrameEx + 16844 27 Python2.5 0x024a8cf8 PyEval_EvalFrameEx + 26072 28 Python2.5 0x024aaef8 PyEval_EvalCodeEx + 3512 29 Python2.5 0x024a7ce0 PyEval_EvalFrameEx + 21952 30 Python2.5 0x024a8cf8 PyEval_EvalFrameEx + 26072 31 Python2.5 0x024aaef8 PyEval_EvalCodeEx + 3512 32 Python2.5 0x023fbb88 function_call + 472 33 Python2.5 0x023c6c08 PyObject_Call + 56 34 Python2.5 0x023d3294 instancemethod_call + 388 35 Python2.5 0x023c6c08 PyObject_Call + 56 36 Python2.5 0x024a0cf4 PyEval_CallObjectWithKeywords + 276 37 Python2.5 0x024f244c t_bootstrap + 60 38 libSystem.B.dylib 0x9002b508 _pthread_body + 96 Is this because R itself isn't thread-safe, or maybe the R code I'm calling? I've found discussions on "why should we make R thread-safe and how" on the website, but there appears to be no date on these documents. The R/Python wrapper functions I'm using: # a variance calculator that returns 0 for vectors that have only 1 non-NaN element: def vvar(a): v=rpy.r.var(a, na_rm=True) if isnan(v): return 0 return v # Calculate the Spearman Rho correlation between a and b and return the result # as scipy.stats.stats.spearmanr() does: R_spearmanr=rpy.r('function(a,b){ kk<-cor.test(a,b,method="spearman"); c( kk$estimate[[1]], kk$p.value) ; }') I'm taking care to make copies of the arrays I'm correlating when initialising the threads. (I can post more of the Python code, if required.) I'm using R 2.3.1 . thanks in advance, Ren? (as always, please CC me on replies sent to the list, thanks!)
Duncan Temple Lang
2006-Oct-12 16:43 UTC
[R] multithreading calling from the rpy Python package
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 [Taken from below]> Is this because R itself isn't thread-safe, or maybe the R code I'm > calling? I've found discussions on "why should we make R thread-safe > and how" on the website, but there appears to be no date on these > documents. >It is a mixture of two things. Yes, R is not thread safe so if two system threads were to access R concurrently, bad things would happen a.s. It is also an issue when Python is compiled and linked with threaded options and routines from the system, e.g. libpthread and R is not. When R is dynamically loaded into the Python process, unless R is very carefully compiled, symbols (i.e. routines) that R uses will come from the Python executable and these may not agree with R's view at compilation. And bad things happen. This depends on your operating system, and it doesn't appear that you have told us what that is. Bad boy :-) This is an issue with Rpy, RSPython, RSPerl, R apache module, rJava, ....... I have started down the road of making R thread-safe and threaded on several occassions. I have not committed these extensive changes for a variety of reasons. One is that a lot of R internals would change and this would have an impact of packages with native code. So we need a way to, at least partially, automate this for package authors. I am making a lot of progress in that front recently with the RGCCTranslationUnit package which allows us to examine C/C++ code from within R. [The following is definitely for R-devel, so anyone replying, please remove the r-help and cc r-devel at r-project.org] And one of the issues that also makes me hesitate in doing this is whether we shouldn't take the time to introduce additional extensive changes in the architecture of an R-like interpreter, e.g. make it extensible at the native level. For stat. computing to continue to grow and for all of us to be able to explore newer areas, we probably need to think about building infrastructure for the next 5- 10 years and not continue to tweak a model that has been around for 30 years. How we do this requires some serious thought and evaluating trade-offs of building things ourselves with a small community or leveraging other existing or emerging systems, e.g. Python, Perl6/Parrot, etc. My $.02 D. Ren? J.V. Bertin wrote:> Hello, > > I don't know if this question ought to go here, or rather on R-devel, > so please bear with me. > > I'm interfacing to R via RPy (rpy.sf.net) and an embedded Python > interpreter. This is really quite convenient. > > I use this approach to calculate the correlation coefficient of 1 > independent dataset (vector) with 4 dependent vectors. It'd be nice if > that could be done in 4 parallel threads, or even two. > > As long as I stick to pure Python code (using equivalents to R > routines that can be found in Numpy and SciPy), this works fine. > (Tested on a single-core machine.) However, when I call R functions > through rpy, a crash will occur at some point, with the error > > *** caught segfault *** > address 0x5164000, cause 'memory not mapped' > > (this is on Mac OS X 10.4.8), somewhere in Rf_eval: > Thread 4 Crashed: > 0 libR.dylib 0x03676af0 Rf_eval + 128 > 1 libR.dylib 0x03676e6c Rf_eval + 1020 > 2 libR.dylib 0x03677108 Rf_eval + 1688 > 3 libR.dylib 0x03676e6c Rf_eval + 1020 > 4 libR.dylib 0x03677108 Rf_eval + 1688 > 5 libR.dylib 0x03676e6c Rf_eval + 1020 > 6 libR.dylib 0x03677108 Rf_eval + 1688 > 7 libR.dylib 0x03678144 Rf_evalList + 148 > 8 libR.dylib 0x036bb5cc do_internal + 796 > 9 libR.dylib 0x03676fbc Rf_eval + 1356 > 10 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 > 11 libR.dylib 0x03676e3c Rf_eval + 972 > 12 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 > 13 libR.dylib 0x03676e3c Rf_eval + 972 > 14 libR.dylib 0x0367a110 do_if + 48 > 15 libR.dylib 0x03676fbc Rf_eval + 1356 > 16 libR.dylib 0x0367932c do_begin + 108 > 17 libR.dylib 0x03676fbc Rf_eval + 1356 > 18 libR.dylib 0x0367ad10 Rf_applyClosure + 1120 > 19 libR.dylib 0x03676e3c Rf_eval + 972 > 20 libR.dylib 0x0361b7c0 protectedEval + 64 > 21 libR.dylib 0x0361c170 R_ToplevelExec + 544 > 22 libR.dylib 0x0361c22c R_tryEval + 60 > 23 _rpy2031.so 0x032f0b8c do_eval_expr + 108 > >>>24 _rpy2031.so 0x032ef950 Robj_call + 688 > > 25 Python2.5 0x023c6c08 PyObject_Call + 56 > 26 Python2.5 0x024a68ec PyEval_EvalFrameEx + 16844 > 27 Python2.5 0x024a8cf8 PyEval_EvalFrameEx + 26072 > 28 Python2.5 0x024aaef8 PyEval_EvalCodeEx + 3512 > 29 Python2.5 0x024a7ce0 PyEval_EvalFrameEx + 21952 > 30 Python2.5 0x024a8cf8 PyEval_EvalFrameEx + 26072 > 31 Python2.5 0x024aaef8 PyEval_EvalCodeEx + 3512 > 32 Python2.5 0x023fbb88 function_call + 472 > 33 Python2.5 0x023c6c08 PyObject_Call + 56 > 34 Python2.5 0x023d3294 instancemethod_call + 388 > 35 Python2.5 0x023c6c08 PyObject_Call + 56 > 36 Python2.5 0x024a0cf4 PyEval_CallObjectWithKeywords + 276 > 37 Python2.5 0x024f244c t_bootstrap + 60 > 38 libSystem.B.dylib 0x9002b508 _pthread_body + 96 > > > Is this because R itself isn't thread-safe, or maybe the R code I'm > calling? I've found discussions on "why should we make R thread-safe > and how" on the website, but there appears to be no date on these > documents. > > The R/Python wrapper functions I'm using: > > # a variance calculator that returns 0 for vectors that have only 1 > non-NaN element: > def vvar(a): > v=rpy.r.var(a, na_rm=True) > if isnan(v): > return 0 > return v > > # Calculate the Spearman Rho correlation between a and b and return the result > # as scipy.stats.stats.spearmanr() does: > R_spearmanr=rpy.r('function(a,b){ kk<-cor.test(a,b,method="spearman"); > c( kk$estimate[[1]], kk$p.value) ; }') > > I'm taking care to make copies of the arrays I'm correlating when > initialising the threads. (I can post more of the Python code, if > required.) > I'm using R 2.3.1 . > > thanks in advance, > Ren? > > (as always, please CC me on replies sent to the list, thanks!) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.- -- Duncan Temple Lang duncan at wald.ucdavis.edu Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Building fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iD8DBQFFLnCV9p/Jzwa2QP4RAkIRAJ9IoVzSThKySLEdriqrIc1ytASqZwCeKtPo dEPN+UBNoItTrz5GgJpdTL8=T+1X -----END PGP SIGNATURE-----
René J.V. Bertin
2006-Oct-12 17:12 UTC
[Rd] [R] multithreading calling from the rpy Python package
Thanks, Duncan,> It is a mixture of two things. Yes, R is not thread safe so if > two system threads were to access R concurrently, bad things would > happen a.s.That's clear, yes. :-/ And a pity, but so be it.> It is also an issue when Python is compiled and linked with > threaded options and routines from the system, e.g. libpthread > and R is not. When R is dynamically loaded into the Python > process, unless R is very carefully compiled, symbols (i.e. routines)I built Python with --enable-threads, but I don't think R has a build option for this?> that R uses will come from the Python executable and these may not > agree with R's view at compilation. And bad things happen.But that would also happen in single-threaded applications, and it doesn't. Unless I'm understanding you wrong...> This depends on your operating system, and it doesn't appear that > you have told us what that is. Bad boy :-)Indeed it depends on the OS. Read again. It says (somewhere...) that I'm using Mac Os X 10.4.8 :P . And under that OS, symbols are not visible by default across shared libraries.> This is an issue with Rpy, RSPython, RSPerl, R apache module, rJava, .......Rpy only allows the creation of a single R "instance". Suppose it were possible, it probably wouldn't help to create as many instances as there are to be threads, right? The "memory not mapped" error message suggests one thread tries to access memory that was just freed by another thread. A bit surprising maybe that this happens in a function that appears to be intended to be recursive (judging from the traceback). As far as I understand, thread-safe means re-entrant which means recursive-safe too... ...> e.g. make it extensible at the native level. For stat. computing > to continue to grow and for all of us to be able to explore newer > areas, we probably need to think about building infrastructure for the > next 5- 10 years and not continue to tweak a model that has been around > for 30 years. How we do this requires some serious thoughtI can't agree more, but have no suggestions....> and evaluating trade-offs of building things ourselves with a small > community or leveraging other existing or emerging systems, e.g. Python, > Perl6/Parrot, etc.Well, Python is great, numpy and scipy allow one to do serious work, but there are things in which R has a clear advantage. Just to name some: handling of missing values is one (and the reason I'm not using numpy or scipy's var function). Slicing is another (somewhat cumbersome in Python), data.frames yet another. I'm not sure how easy it would be to extend Python's syntax to accomodate for something useful as a[ is.na(a) ] <- -1 R.B.
René J.V. Bertin
2006-Oct-20 14:11 UTC
[Rd] [R] multithreading calling from the rpy Python package
Since Python has been mentioned in this context: Could not Python's threading model and implementation serve as a guideline?>From a few simple benchmarks I've run, it seems as if the Pythoninterpreter itself is thread-safe but not threadable. That is, when I run something "pure Python" like a recursive function that returns the nth Fibonacci number in parallel, there is no speed-up for 2 threads on a dual-processor machine. However, calling sleep in parallel does scale down with the number of threads, even on a single-processor ;) Real-life code does tend to speed up somewhat, though never as much as one would hope. Just an idea... Ren?
Possibly Parallel Threads
- multithreading calling from the rpy Python package
- MacOSX 10.4.11 update breaks tests/lapack.R (R 2.6.0)? (PR#10454)
- Hanging -- please help decipher event report
- Segmentation fault/buffer overflow with fix() in Fedora Core 5 from Extras repository
- Error: corrupted double-linked list