Erik van Zijst
2006-Dec-08 14:51 UTC
[Rd] Pre-compilation and server-side parallel execution
Folks, My company operates a platform that distributes real-time financial data from exchanges to users. To extend our services I want to allow users to write and submit custom R scripts to our platform that operate on our streaming data to do real-time analysis. We have thousands of users deploying scripts and each script is evaluated repeatedly when certain conditions in the stream apply. For example, a script could compute the NASDAQ100 index value each time one of its 100 constituents trade. Scripts are typically small and execute quickly. Each script is registered once and then repeatedly evaluated with different parameters (possibly several times per second per script). In this context my biggest concern is scalability. The evaluation engine is a pure server-side component without display abilities. An R-script is invoked with parameters and whatever it returns is sent to the user. Ideally I'd need a C api to interact with the interpreter. I've looked at projects like R/Apache, RServe and RSJava for inspiration and came to the conclusion that all these projects work by forking multiple instances of the R-engine where each instance evaluates one script at a time. As our service must evaluate many different scripts concurrently (isolated from one another), I have the following concerns: 1. Spawning a pool of engine instances for massive parallel execution is expensive, but might work with lots of memory. 2. R's native C-api [http://cran.r-project.org/doc/manuals/R-exts.html#The-R-API] does not separate parsing from evaluation. When the same script is evaluated 10 times, it is also parsed 10 times. I'm mostly concerned about the second issue. Our scripts are registered once and continuously evaluated. I want to avoid parsing the same script again each time it is evaluated. Does the engine recognize previously parsed scripts (like oracle does for SQL queries)? I interested to hear your thoughts on my concerns and whether you think R would work in this architecture. kind regards, Erik van Zijst -- And on the seventh day, He exited from append mode.
Simon Urbanek
2006-Dec-08 22:28 UTC
[Rd] Pre-compilation and server-side parallel execution
On Dec 8, 2006, at 9:51 AM, Erik van Zijst wrote:> 2. R's native C-api > [http://cran.r-project.org/doc/manuals/R-exts.html#The-R-API] does > not separate parsing from evaluation.Actually it does - see "R_ParseVector" and "eval". You're free to run the parser once (or even construct the expression directly) and evaluate it many times. (Also note that you can serialize the parsed expression if desired). If your worries are really at this level, then you will have to create entirely your own solution, because the overhead of IPC will be way more that the time spent in the parser. Actually I'm wondering whether you checked it at all, because I'd almost certainly expect the evaluation to take way more time than the parsing step. If it does, I'd be inclined to think that you have rather a design problem. Cheers, Simon
On 12/8/06, Erik van Zijst <r at erik.prutser.cx> wrote:> 2. R's native C-api > [http://cran.r-project.org/doc/manuals/R-exts.html#The-R-API] does not > separate parsing from evaluation. When the same script is evaluated 10 > times, it is also parsed 10 times. > > I'm mostly concerned about the second issue. Our scripts are registered > once and continuously evaluated. I want to avoid parsing the same script > again each time it is evaluated. Does the engine recognize previously > parsed scripts (like oracle does for SQL queries)?A database server is doing rather more than simply parsing a query--it's also running a query planner to optimize execution and quite possibly a number of other things so it behooves the DBMS to cache that information whenever possible. The closest functional equivalent in R would be wrapping everything in a function and then serializing the resulting function somewhere.
Apparently Analagous Threads
- How to execute R scripts simultaneously from multiple threads
- Re: [PATCH] RFC: rhv-upload-plugin: Use imageio client
- Re: [PATCH] RFC: rhv-upload-plugin: Use imageio client
- Re: [PATCH v3] v2v: -o rhv-upload: Use Unix domain socket to access imageio (RHBZ#1588088).
- [PATCH] RFC: rhv-upload-plugin: Use imageio client