Timothy Murphy
2010-Oct-29 21:20 UTC
[R] NetWorkSpace from REvolution; Distributed Computing setup questions
***Summary:*** I'm setting up a cluster using netWorkSpace, and I'm having issues with the sleigh initialization. My R function to initialize the sleigh succeeds and the sleigh appears to be ready, but I get apparently conflicting information from "status(s)", "rankCount(s)", and "s"; and basic sleigh functions cause the sleigh to hang indefinitely. Also, the log file contains an error that indicates that the script is trying to find a file in a nonexistent directory: "/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh: 37: /Library/Frameworks/R.framework/Resources/bin/R: not found" (see section 4). I've spent quite a bit of time trying to debug this, and I've gathered here all the information that I think may be pertinent to solving the problem. The following is therefore a bit lengthy, but I think complete (as far as I'm able to tell from the existing documentation). It's organized into sections roughly by the topic tested. So if you're familiar with the workings of netWorkSpaces, I would be very grateful if you would take a look at my diagnostics below and tell me if you can identify the problem. ***Details:*** ***Section 1*** Currently my setup is: MASTER: MacBook Pro running OS X 10.6.4 R-2.11.1, Python 2.6.1, and NWSserver-2.0.0. WORKER: Optiplex GX620 running Ubuntu 10.10 (64bit), R-2.11.1, Python 2.6.6, NWSserver-2.0.0, and NWS-2.0.0.3 (client) R and Python are in the PATH on both machines; I can start them from the worker's command line by typing "R" or "python". The client is able to find the "RNWSSleighWorker.sh" file. (Note 1: I put the server software on the client because I was getting an message saying: "No nws server found" each time I tried to install the client software. I don't know if this is needed) (Note 2: I plan to set up many more machines if I can get this working) (Note 3: Originally I was trying this on Windows machines with Cygwin, but I encountered the same error and figured I could at least rule out a possible cause by setting it up on a linux machine. Ultimately I would like to get this working in Windows/Cygwin.) ***Section 2*** The function I used to start the sleigh is: s=sleigh( + nwsHost="172.30.xx.xx", + nwsPort=8765, + launch=sshcmd, + nodeList=c("10.85.xxx.xxx"), + scriptExec=envcmd, + scriptDir="/usr/local/lib/R/site-library/nws/bin", + scriptName="RNWSSleighWorker.sh", + workingDir='~/tmp/', + logDir='~/tmp/', + outfile="outfileTest", + user="tj") This function returns the message below and then clear command prompt: Executing command: '/Library/Frameworks/R.framework/Resources/library/nws/bin/SleighWorkerWrapper.sh' 'ssh' '-f' '-x' '-l' 'tj' '10.85.101.109' 'env' 'RSleighName=10.85.101.109' 'RSleighNwsName=sleigh_ride_0450__nwssNGG4LF' 'RSleighUserNwsName=sleigh_user_0452__nwssNGG4LF' 'RSleighID=1' 'RSleighWorkerCount=1' 'RSleighScriptDir=/usr/local/lib/R/site-library/nws/bin' 'RSleighNwsHost=172.30.34.71' 'RSleighNwsPort=8765' 'RSleighWorkingDir=~/tmp/' 'RProg=/Library/Frameworks/R.framework/Resources/bin/R' 'RSleighWorkerOut=sleigh_ride_0450__nwssNGG4LF_0001.txt' 'RSleighLogDir=~/tmp/' '/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh' If I type the name of the sleigh "s" as below, I get information that makes it look like the sleigh is ready to receive commands:> sNWS Sleigh Object NWS Host: 172.30.xx.xx:8765 Workspace Name: sleigh_ride_0446__nwssNGG4LF 1 Worker Nodes: 10.85.xxx.xxx Likewise, if I send a simple ssh command to the worker I get a response:> system('ssh tj at 10.85.101.109 date')Fri Oct 29 15:15:10 EDT 2010 I can also communicate values between the two machines using the NWS web server and the nwsStore() and nwsFetch() functions. However, if I check the status of the sleigh using "status(s)" or "rankCount(s)", I get less encouraging information:> status(s)$numWorkers [1] 0 $closed [1] 0> rankCount(s6)[1] 0 ***Section 3*** I can access the NWS server through "localhost:8766" and see that sleighs are being created. There are two entries: a sleigh_ride and a sleigh_user; but the worker count in the sleigh_ride is also zero. If I execute either of the following test sleigh functions, the sleigh will hang indefinitely (though the R terminal will not hang, since I used blocking=false): eachWorker(s5, Sys.info, eo=list(blocking=FALSE)) eachWorker(s, function() library(nws), eo=list(blocking=FALSE)) ***Section 4*** ***CRUX OF THE ISSUE (probably):*** Finally, three files get created in the "~/tmp" directory that I specified as the logDir and workingDir, named: "outfileTest", "RSleighSentinelLog_1000_1" , and "sleigh_ride_0450__nwssNGG4LF_0001.txt". All three contain exactly the same information: "/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh: 37: /Library/Frameworks/R.framework/Resources/bin/R: not found" Whats puzzling is the "Library/Frameworks/R.framework/Resources/bin/R" part. That looks like an OS X-style path rather than Ubuntu-style. I didn't specify that path anywhere, but it definitely exists on the MacBook side. I notice that it occurs as the "RProg" value in the message that's returned when I run the sleigh function; but I can't include it in the sleigh function as an option, ie the following:> s=sleigh(+ nwsHost="172.30.xx.xx", + nwsPort=8765, + launch=sshcmd, + nodeList=c("10.85.xxx.xxx"), + scriptExec=envcmd, + scriptDir="/usr/local/lib/R/site-library/nws/bin", + scriptName="RNWSSleighWorker.sh", + workingDir='~/tmp/', + logDir='~/tmp/', + outfile="outfileTest", + user="tj", + RProg="/usr/bin/R", + verbose=TRUE) Error in initialize(value, ...) : unused argument(s) RProg (note that "/usr/bin/R" is the response I get when I type "which R" on either the master or worker machine, so would expect that it would be a valid value for RProg.) ***Section 5*** Further, the first thing that the "RNWSSleighWorker.sh" script does is: RProg=${RProg:-'R'}, which just creates an environment variable on the worker machine with the value $RProg=R. The first time $RProg is used is on line 36. If I execute this line on the worker machine, I get a blank R command prompt that responds to any command with another blank prompt. tj at clusterWorker1:~$ $RProg --vanilla --slave <<'EOF' > ${RSleighLogFile} 2>&1 &>I tried running the sleigh() function from the master after doing this, thinking that it would activate the R as a slave on the worker machine and and allow it to connect, but no luck. At this point, I've run out of ideas on things to test. I appreciate you if you've read all this, and I'll appreciate you even more if you can give me some help. T Thanks! TJ Murphy
yeoldefortran
2010-Nov-10 21:08 UTC
[R] NetWorkSpace from REvolution; Distributed Computing setup questions
This is very late, but in case you are still looking for the solution to this: Everything you did was right on, except the argument to sleigh is 'rprog' instead of 'RProg'. That should fix the problem. -- View this message in context: http://r.789695.n4.nabble.com/NetWorkSpace-from-REvolution-Distributed-Computing-setup-questions-tp3019785p3036894.html Sent from the R help mailing list archive at Nabble.com.