thr3ads.net - R devel - [Rd] Measuring and comparing .C and .Call overhead [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Alireza Mahani

2011-Jul-19 15:07 UTC

[Rd] Measuring and comparing .C and .Call overhead

Further pursuing my curiosity to measure the efficiency of R/C++ interface, I
conducted a simple matrix-vector multiplication test using .C and .Call
functions in R. In each case, I measured the execution time in R, as well as
inside the C++ function. Subtracting the two, I came up with a measure of
overhead associated with each call. I assume that this overhead would be
non-existent of the entire code was implemented in C++. [Please let me know
if my approach is somehow faulty.] The results of this simple test have
significant implications for my software development approach. I hope others
find this information useful as well. I have attached the code files in the
bottom. Below I provide a summary output, and interpret the results. [All
times in table below are in microseconds]

m/n/time_.Call_R/time_.Call_C++/.Call_overhead/time_.C_R/time_.C_C++/.C_overhead
100/10/3.23/1.33/*1.90*/16.89/0.96/*15.93*
200/15/5.38/3.44/*1.94*/30.77/2.81/*27.96*
500/20/12.48/10.4/*2.08*/78.12/9.37/*68.75*

Interpretation:

1- .Call overhead holds nearly constant, i.e. independent of data size. This
is expected since .Call works by passing pointers. (How can we explain the
slight increase in overhead?)
2- C++ times for .C are somewhat better than .Call. This is likely to be due
to the overhead associated with unpacking the SEXP pointers in a .Call
function.
3- The overhead for .C dominates the execution time. For a 500x20 matrix,
the overhead is ~90% of total time. This means that whenever we need to make
repeated calls to a C/C++ function from R, and when performance is important
to us, .Call is much preferred to .C, even at modest data sizes.
4- Overhead for .C scales sub-linearly with data size. I imagine that this
overhead can be reduced through registering the functions and using the
style field in R_CMethodDef to optimize data transfer (per Section 5.4 of
"Writing R Extensions"), but perhaps not by more than a half.
5- Even using .Call, the overhead is still a significant percentage of total
time, though the contribution decreases with data size (and compute load of
function). However, in the context of parallelization benefits, this
overhead puts an unpleasant cap on achievable gains. For example, even if
the compute time goes down by 100x, if overhead was even 10% of the time,
our effective gain will saturate at 10x. In other words, effective
parallelization can only be achieved by moving big chunks of the code to
C/C++ so as to avoid repeated calls from R to C.

I look forward to your comments.

-Alireza
--------------------------------
#code file: mvMultiply.r
#measuring execution time for a matrix-vector multiplication (A%*%x) using
.C and .Call functions
#execution time is measured both in R and inside the C++ functions;
subtracting the two is 
#used as an indicator of overhead associated with each call

rm(list = ls())
system("R CMD SHLIB mvMultiply.cc")
dyn.load("mvMultiply.so")

m <- 100 #number of rows in matrix A
n <- 10 #number of columns in matrix A (= number of rows in vector x)
N <- 1000000

A <- runif(m*n)
x <- runif(n)

# measuring .Call
tCall_c <- 0.0
t1 <- proc.time()[3]
for (i in 1:N) {
	tCall_c <- tCall_c + .Call("matvecMultiply", as.double(A),
as.double(x))
}
tCall_R <- proc.time()[3] - t1
cat(".Call - Time measured in R: ", round(tCall_R,2),
"sec\n")
cat(".Call - Time measured in C++: ", round(tCall_c,2),
"sec\n")
cat(".Call - Implied overhead: ", round(tCall_R,2) - round(tCall_c,2),
"sec
->  per call: ", 1000000*(round(tCall_R,2) - round(tCall_c,2))/N,
"usec\n")

# measuring .C
tC_c <- 0.0
t1 <- proc.time()[3]
for (i in 1:N) {
	tC_c <- tC_c + .C("matvecMultiply2", as.double(A), as.double(x),
double(c(m)), as.integer(c(m)), as.integer(c(n)), t = double(1))$t
}
tC_R <- proc.time()[3] - t1
cat(".C - Time measured in R: ", round(tC_R,2), "sec\n")
cat(".C - Time measured in C++: ", round(tC_c,2), "sec\n")
cat(".C - Implied overhead: ", round(tC_R,2) - round(tC_c,2),
"sec  ->  per
call: ", 1000000*(round(tC_R,2) - round(tC_c,2))/N, "usec\n")

dyn.unload("mvMultiply.so")
--------------------------------
#code file: myMultiply.cc
#include <Rinternals.h>
#include <R.h>
#include &lt;sys/time.h&gt;

extern "C" {

SEXP matvecMultiply(SEXP A, SEXP x) {
	timeval time1, time2;
	gettimeofday(&time1, NULL);
	SEXP execTime_sxp;
	PROTECT(execTime_sxp = allocVector(REALSXP, 1));
	double *execTime; execTime = REAL(execTime_sxp);
	double *rA = REAL(A), *rx = REAL(x), *ry;
	int n = length(x);
	int m = length(A) / n;
	SEXP y;
	PROTECT(y = allocVector(REALSXP, m));
	ry = REAL(y);
	for (int i = 0; i < m; i++) {
		ry[i] = 0.0;
		for (int j = 0; j < n; j++) {
			ry[i] += rA[j * m + i] * rx[j];
		}
	}
	UNPROTECT(1);
	gettimeofday(&time2, NULL);
	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
time1.tv_usec)/1000000.0;
	UNPROTECT(1);
	return execTime_sxp;
}

void matvecMultiply2(double *A, double *x, double *y, int *m_ptr, int
*n_ptr, double *execTime) {
	timeval time1, time2;
	gettimeofday(&time1, NULL);
	int m = *m_ptr;
	int n = *n_ptr;
	for (int i = 0; i < m; i++) {
		y[i] = 0.0;
		for (int j = 0; j < n; j++) {
			y[i] += A[j * m + i] * x[j];
		}
	}
	gettimeofday(&time2, NULL);
	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
time1.tv_usec)/1000000.0;
}

}
--------------------------------


--
View this message in context:
http://r.789695.n4.nabble.com/Measuring-and-comparing-C-and-Call-overhead-tp3678361p3678361.html
Sent from the R devel mailing list archive at Nabble.com.

Simon Urbanek

2011-Jul-19 18:47 UTC

head link

[Rd] Measuring and comparing .C and .Call overhead

On Jul 19, 2011, at 11:07 AM, Alireza Mahani wrote:
> Further pursuing my curiosity to measure the efficiency of R/C++ interface,
I
> conducted a simple matrix-vector multiplication test using .C and .Call
> functions in R. In each case, I measured the execution time in R, as well
as
> inside the C++ function. Subtracting the two, I came up with a measure of
> overhead associated with each call. I assume that this overhead would be
> non-existent of the entire code was implemented in C++. [Please let me know
> if my approach is somehow faulty.] The results of this simple test have
> significant implications for my software development approach. I hope
others
> find this information useful as well. I have attached the code files in the
> bottom. Below I provide a summary output, and interpret the results. [All
> times in table below are in microseconds]
> 
>
m/n/time_.Call_R/time_.Call_C++/.Call_overhead/time_.C_R/time_.C_C++/.C_overhead
> 100/10/3.23/1.33/*1.90*/16.89/0.96/*15.93*
> 200/15/5.38/3.44/*1.94*/30.77/2.81/*27.96*
> 500/20/12.48/10.4/*2.08*/78.12/9.37/*68.75*
> 
Your "overhead" calculations are anything but since they include a
loop, value replacement, arithmetics, unnecessary function calls (you don't
need as.double() for .Call - it's more efficient to use coerceVector() or
throw an error depending on what you desire) - that have nothing to do with
.Call.

In addition, if you really cared about overhead, you would avoid symbol lookup
millions of times:

foo.c:
#include <Rinternals.h>
SEXP foo() { return R_NilValue; }
> system.time(for(i in 1:1e6) .Call("foo"))   user  system elapsed 
  4.943   0.016   4.960 > foo = getNativeSymbolInfo("foo")$address
> system.time(for(i in 1:1e6) .Call(foo))   user  system elapsed 
  0.363   0.001   0.365 

This is very close to just calling a constant closure with the same result:
> f=function() NULL
> system.time(for(i in 1:1e6) f())   user  system elapsed 
  0.238   0.001   0.239 

So .Call overhead is essentially negligible - and not what you were measuring
;).

There is only a tiny overhead in forcing the argument promise, but it's
constant:
> x=rnorm(1e6)
> system.time(for(i in 1:1e6) .Call(foo,x))   user  system elapsed 
  0.455   0.001   0.456 > x=rnorm(1e9)
> system.time(for(i in 1:1e6) .Call(foo,x))   user  system elapsed 
  0.454   0.000   0.455 

and it's again no different that for a closure:> f=function(x) NULL
> system.time(for(i in 1:1e6) f())   user  system elapsed 
  0.259   0.001   0.259 > system.time(for(i in 1:1e6) f(x))   user  system elapsed 
  0.329   0.001   0.329 

so this is inherent to any call. .Call is the fastest you can get, there is
nothing faster (other than hacking your code into R internals...).

Cheers,
Simon

> Interpretation:
> 
> 1- .Call overhead holds nearly constant, i.e. independent of data size.
This
> is expected since .Call works by passing pointers. (How can we explain the
> slight increase in overhead?)
> 2- C++ times for .C are somewhat better than .Call. This is likely to be
due
> to the overhead associated with unpacking the SEXP pointers in a .Call
> function.
> 3- The overhead for .C dominates the execution time. For a 500x20 matrix,
> the overhead is ~90% of total time. This means that whenever we need to
make
> repeated calls to a C/C++ function from R, and when performance is
important
> to us, .Call is much preferred to .C, even at modest data sizes.
> 4- Overhead for .C scales sub-linearly with data size. I imagine that this
> overhead can be reduced through registering the functions and using the
> style field in R_CMethodDef to optimize data transfer (per Section 5.4 of
> "Writing R Extensions"), but perhaps not by more than a half.
> 5- Even using .Call, the overhead is still a significant percentage of
total
> time, though the contribution decreases with data size (and compute load of
> function). However, in the context of parallelization benefits, this
> overhead puts an unpleasant cap on achievable gains. For example, even if
> the compute time goes down by 100x, if overhead was even 10% of the time,
> our effective gain will saturate at 10x. In other words, effective
> parallelization can only be achieved by moving big chunks of the code to
> C/C++ so as to avoid repeated calls from R to C.
> 
> I look forward to your comments.
> 
> -Alireza
> --------------------------------
> #code file: mvMultiply.r
> #measuring execution time for a matrix-vector multiplication (A%*%x) using
> .C and .Call functions
> #execution time is measured both in R and inside the C++ functions;
> subtracting the two is 
> #used as an indicator of overhead associated with each call
> 
> rm(list = ls())
> system("R CMD SHLIB mvMultiply.cc")
> dyn.load("mvMultiply.so")
> 
> m <- 100 #number of rows in matrix A
> n <- 10 #number of columns in matrix A (= number of rows in vector x)
> N <- 1000000
> 
> A <- runif(m*n)
> x <- runif(n)
> 
> # measuring .Call
> tCall_c <- 0.0
> t1 <- proc.time()[3]
> for (i in 1:N) {
> 	tCall_c <- tCall_c + .Call("matvecMultiply", as.double(A),
as.double(x))
> }
> tCall_R <- proc.time()[3] - t1
> cat(".Call - Time measured in R: ", round(tCall_R,2),
"sec\n")
> cat(".Call - Time measured in C++: ", round(tCall_c,2),
"sec\n")
> cat(".Call - Implied overhead: ", round(tCall_R,2) -
round(tCall_c,2), "sec
> ->  per call: ", 1000000*(round(tCall_R,2) - round(tCall_c,2))/N,
"usec\n")
> 
> # measuring .C
> tC_c <- 0.0
> t1 <- proc.time()[3]
> for (i in 1:N) {
> 	tC_c <- tC_c + .C("matvecMultiply2", as.double(A),
as.double(x),
> double(c(m)), as.integer(c(m)), as.integer(c(n)), t = double(1))$t
> }
> tC_R <- proc.time()[3] - t1
> cat(".C - Time measured in R: ", round(tC_R,2),
"sec\n")
> cat(".C - Time measured in C++: ", round(tC_c,2),
"sec\n")
> cat(".C - Implied overhead: ", round(tC_R,2) - round(tC_c,2),
"sec  ->  per
> call: ", 1000000*(round(tC_R,2) - round(tC_c,2))/N,
"usec\n")
> 
> dyn.unload("mvMultiply.so")
> --------------------------------
> #code file: myMultiply.cc
> #include <Rinternals.h>
> #include <R.h>
> #include &lt;sys/time.h&gt;
> 
> extern "C" {
> 
> SEXP matvecMultiply(SEXP A, SEXP x) {
> 	timeval time1, time2;
> 	gettimeofday(&time1, NULL);
> 	SEXP execTime_sxp;
> 	PROTECT(execTime_sxp = allocVector(REALSXP, 1));
> 	double *execTime; execTime = REAL(execTime_sxp);
> 	double *rA = REAL(A), *rx = REAL(x), *ry;
> 	int n = length(x);
> 	int m = length(A) / n;
> 	SEXP y;
> 	PROTECT(y = allocVector(REALSXP, m));
> 	ry = REAL(y);
> 	for (int i = 0; i < m; i++) {
> 		ry[i] = 0.0;
> 		for (int j = 0; j < n; j++) {
> 			ry[i] += rA[j * m + i] * rx[j];
> 		}
> 	}
> 	UNPROTECT(1);
> 	gettimeofday(&time2, NULL);
> 	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
> time1.tv_usec)/1000000.0;
> 	UNPROTECT(1);
> 	return execTime_sxp;
> }
> 
> void matvecMultiply2(double *A, double *x, double *y, int *m_ptr, int
> *n_ptr, double *execTime) {
> 	timeval time1, time2;
> 	gettimeofday(&time1, NULL);
> 	int m = *m_ptr;
> 	int n = *n_ptr;
> 	for (int i = 0; i < m; i++) {
> 		y[i] = 0.0;
> 		for (int j = 0; j < n; j++) {
> 			y[i] += A[j * m + i] * x[j];
> 		}
> 	}
> 	gettimeofday(&time2, NULL);
> 	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
> time1.tv_usec)/1000000.0;
> }
> 
> }
> --------------------------------
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/Measuring-and-comparing-C-and-Call-overhead-tp3678361p3678361.html
> Sent from the R devel mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

R devel - Jul 2011 - Measuring and comparing .C and .Call overhead

[Rd] Measuring and comparing .C and .Call overhead

[Rd] Measuring and comparing .C and .Call overhead

Possibly Parallel Threads