SUMMARY.
Thank you to all who responded to my question. As usual this list has proven
to be an incredible source of information. For reference I include all
replies I've received.
In summary there are four "products" out there (the interpretations
are mine
so please consult the original messages below as necessary):
* www.mosix.org a Linux kernel patch that "automatically
shuffles processes around a network of Linux machines to maximize
performance & make use of memory".
* Codine batch queuing system.  Codine is freely available from
www.sun.com and works reasonably well on Linux boxes. To submit a job one
just prepend it with 'q'. Poor documentation is cited as a major
drawback
* GNU queues. Seem to be very similar to Codine
* Load Sharing Facility. There is no information where it can be found, but
is cited not once along with analysis of pros and cons
Again, those interested in more details should read the original messages
included below as they have a lot of valuable information.
Thanks,
Vadim
From:	Taylor, Z Todd [todd.taylor at pnl.gov]
Sent:	Tuesday, September 18, 2001 7:54 AM
To:	'Vadim Ogranovich'
Cc:	Taylor, Z Todd
Subject:	RE: [R] computational capacity of Linux network
Vadim Ogranovich [mailto:vograno at arbitrade.com] wrote:
> As a part of our work we run a lot of non-interactive 
> computational jobs. To
> increase the throughput we would like to distribute the load 
> over the entire
> network and we are looking at Linux network as a platform. 
> Ideally we would
> like to be able to submit a job to the network, rather than 
> to a computer,
> and let the "network" to figure out the resource to run the job
on.
> 
> I'd like to hear from those people who have had an experience 
> of working in
> such an environment. Specifically
> * what system (Red Hat, etc.) was used? What kind of software 
> was used for
> load balancing? 
I've done a lot of heavy crunching using Linux and the Codine
batch queueing system.  Codine is freely available from
www.sun.com and works reasonably well on Linux boxes.
> * How big is the USER overhead to set such a job 
> comparatively to running a
> command from shell? How reliable and convenient error 
> reporting is if the
> job fails?
Codine uses an extension to the shell language(s) to allow users
to tailor their scripts.  For my purposes, I wrote a simple
wrapper script that lets any user submit any job without special
requirements by simply prepending a 'q' to it.  E.g.,
   q my_big_program -arg1 -arg2 -arg3
Very conventient.  Error reporting is okay.  You can direct stderr
to a file and/or send email to a particular user if something goes
wrong.  Codine has various settings for restarting queues if the
machine goes down in the middle of a job.  Errors that happen in
the queueing system (as opposed to your script) are a little harder
to figure out.
> * how much of sysadmin time is required to maintain such a network?
> * any other comment.
Once things are running, not much.  Codine's installation routines
make that part pretty easy as long as nothing goes wrong.  Mine
were all easy except the RedHat 7.1 boxes, for which I had to make
one minor change to an install script (documented in Sun's FAQ).
My system is a mixture of RH 6.* and 7.1.
Codine's documentation is weak at best.  In fact, I'd say it's quite
poor.  I had been using DQS (Florida State University) before Codine,
and since Codine is a derivative thereof, I've been able to figure
most things out.  But a newcomer might have a little trouble getting
bootstrapped.
All in all I'm reasonably pleased with Codine.
> Please reply directly to me and I'll post the summary.
> 
> Thanks,
> Vadim
--Todd
-- 
Z. Todd Taylor
Pacific Northwest National Laboratory
Todd.Taylor at pnl.gov
Why do you say the alarm went off, when really it came on?
From:	detlef.steuer at UniBw-Hamburg.DE
Sent:	Tuesday, September 18, 2001 1:11 AM
To:	Vadim Ogranovich
Subject:	RE: [R] computational capacity of Linux network
Hi!
We did the same last week.
Using mosix (www.mosix.org) gives a virtual supercomputer without any user
overhead. Perhaps a slow net is a problem.
Distribution doesn?t matter. We usr RedHat and SuSE in a mix.
Any distribution should do, because mosix is nothing but a kernel patch.
Sysadmin time can not be estimated from a week of experience. After setting
up
we had no problems.
Seems to be a really great tool.
detlef
 On 17-Sep-2001 Vadim Ogranovich wrote:> Hi, This is not an R question per ce, but I feel like this is a right
[...]
Detlef Steuer ** Universit?t der Bw ** 22043 Hamburg
Tel: (0049) (0)40/6541-2819
steuer at unibw-hamburg.de
"Whenever there is a conflict between human rights and 
property rights, human rights must prevail." - Lincoln
From:	Michael Mader [m.mader at gsf.de]
Sent:	Monday, September 17, 2001 10:40 PM
To:	Vadim Ogranovich
Subject:	Re: [R] computational capacity of Linux network
Vadim Ogranovich wrote:
> load balancing?
We are successfully using lsf (Load Sharing Facility) on our Tru64
(workstation) clusters. IRIX is ok as well and Linux is at least
supported (haven't tested lsf on it).
> * How big is the USER overhead to set such a job comparatively to running
a> command from shell? 
Depends on. Straitforward jobs are as simple as possible. Job
dependencies are quite easy as well.
There seems to be a X GUI around but I've never used it.
> How reliable and convenient error reporting is if the
> job fails?
Reports success, no answer <=> no success. An Obstacle are network
breakdowns (queue might not be used any more, in severe cases jobs can
be lost and have to be resubmitted).
> * how much of sysadmin time is required to maintain such a network?
Mainly installation/updating. Saves a lot of time compared to host
specific job submission systems.
> 
-- 
Michael T. Mader
Institute for Bioinformatics/MIPS, GSF
0049-89-3187-3576
From:	Warnes, Gregory R [gregory_r_warnes at groton.pfizer.com]
Sent:	Monday, September 17, 2001 4:44 PM
To:	'Vadim Ogranovich'
Cc:	'r-help at stat.math.ethz.ch'
Subject:	RE: [R] computational capacity of Linux network
Hi Vadim,
I have set up just such a system using MOSIX (http://www.mosix.org) and
ClusterNFS (http://ClusterNFS.sourceforege.net). MOSIX automatically
shuffles processes around a network of Linux machines to maximize
performance & make use of memory.  I've successfully used R, C, and Java
programs in this environment and have been very happy. 
<blatant advertisement>
I developed ClusterNFS to simplify the setup and maintenance of a group of
Linux machines by allowing all of the machines to share a common root
filesystem.  This is accomplished by providing 'interpreted file names'
so
that configuration files can differ as necessary, while maintaining the
common file system.
</blatant advertisement>
I used them both with Debian, but they can be used with any distribution.  
-Greg
From:	Aravind Subramanian [aravind at genome.wi.mit.edu]
Sent:	Monday, September 17, 2001 3:58 PM
To:	Vadim Ogranovich
Subject:	Re: [R] computational capacity of Linux network
A commercial(?) software package that does this and which is used by the
systems
group here is load sharing facility (lsf). I havent used it myself but have
heard
that it works quite well.
aravind
From:	Harry Mangalam [mangalam at home.com]
Sent:	Monday, September 17, 2001 5:29 PM
To:	Vadim Ogranovich
Subject:	Re: [R] computational capacity of Linux network
If you use GNU queue, it's very cheap, easy to use, and once set up, trivial
to maintain.  It's not the easiest in the world to set up, about as
difficult as setting up samba or NFS correctly (NFS is
required to make use of it).
It's certianly not as full-featured as LSF or one of the other commercial
load-levelling utils, but it's great value for the $ (even considering the
amount of time it takes to set up).
In terms of submitting the job to the network, it does exactly this, and
queue takes care of figuring out which machine to run it on, etc.  From a
user's POV, all you do is prefix the job with 'q' and
queue takes care of the rest.
hjm
Cheers, Harry
Harry J Mangalam -- (949) 856 2847 (v&f) -- mangalam at home.com 
         [plain text appreciated, if possible]
From:	Dirk Eddelbuettel [edd at debian.org]
Sent:	Monday, September 17, 2001 2:38 PM
To:	Vadim Ogranovich
Subject:	Re: [R] computational capacity of Linux network
On Mon, Sep 17, 2001 at 02:40:43PM -0500, Vadim Ogranovich
wrote:> As a part of our work we run a lot of non-interactive computational jobs.
To> increase the throughput we would like to distribute the load over the
entire> network and we are looking at Linux network as a platform. Ideally we
would> like to be able to submit a job to the network, rather than to a computer,
> and let the "network" to figure out the resource to run the job
on.
Mosix does that.
> I'd like to hear from those people who have had an experience of
working
in> such an environment. Specifically
> * what system (Red Hat, etc.) was used? What kind of software was used for
> load balancing? 
You can use any. The original beowulf was built on RedHat, but many distros
can be used. See the debian-beowulf mailing list archives for Debian, and in
particular the set of already packages tools and libraries that are
available for Debian.
> * How big is the USER overhead to set such a job comparatively to running
a> command from shell?
Not really any under eg Mosix.
> How reliable and convenient error reporting is if the job fails?
That is pretty much unchanged from when otherb atched jobs fail.
> * how much of sysadmin time is required to maintain such a network?
Depends on how many machines, how good an infrastructure (ie Debian is
still easy to maintain in large quantities).
> * any other comment.
> 
> Please reply directly to me and I'll post the summary.
Aren't you in Chicago?  I'd *love* to work on such a project.
Cheers, Dirk
(a quant finance PhD)
-- 
Three out of two people have difficulties with fractions.
From:	Huntsinger, Reid [reid_huntsinger at merck.com]
Sent:	Monday, September 17, 2001 1:50 PM
To:	'Vadim Ogranovich'
Subject:	RE: [R] computational capacity of Linux network
If you haven't already, you should look at the Mosix site www.mosix.org and
the mailing list. Mosix is a kernel module/patch for Linux on Intel-like
machines which might do more or less what you want. 
Reid Huntsinger
-------------------------------------------------- 
DISCLAIMER 
This e-mail, and any attachments thereto, is intended only for use by the
addressee(s) named herein and may contain legally privileged and/or
confidential information.  If you are not the intended recipient of this
e-mail, you are hereby notified that any dissemination, distribution or
copying of this e-mail, and any attachments thereto, is strictly prohibited.
If you have received this e-mail in error, please immediately notify me and
permanently delete the original and any copy of any e-mail and any printout
thereof. 
E-mail transmission cannot be guaranteed to be secure or error-free.  The
sender therefore does not accept liability for any errors or omissions in
the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY 
Knight Trading Group may, at its discretion, monitor and review the content
of all e-mail communications. 
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._