On Sep 5, 2009, at 2:31 PM, Peter Juhasz wrote:
> Reposting from R-help:
>
> Dear R experts,
>
> please excuse me for writing to the mailing list without subscribing.
> I have a somewhat urgent problem that relates to R.
>
> I have to process large amounts of data with R - I'm in an
> international collaboration and the data processing protocol is fixed,
> that is a specific set of R commands has to be used.
>
> I wrote a perl program that manages creation of data subsets from my
> database and feeds these subsets to an R process via pipes.
>
> This worked all right, however, I wanted to speed things up by
> exploiting the fact that I have a dual-core machine. So I rewrote my
> perl driver program to use two threads, each starting its own R
> instance, getting data off a queue and feeding it to its R process.
>
> This also worked, except that I noticed something very peculiar: the
> processing time was almost exactly the same for both cases. I did some
> tests to look at this, and it seems that R needs twice the time to do
> the exact same thing if there are two instances of it running.
>
> I don't understand how is this possible. Maybe there is an issue of
> thread-safety with the R backend, meaning that the two R *interpreter*
> instances are talking to the same backend that's capable of processing
> only one thing at a time?
>
No, are least not in R itself. Clearly there are many explanations
(you are accessing the data in some way that is not parallelizable, R
is already using both cores, perl does something funny that you are
not anticipating ...), but I see too little evidence. The perl code to
too much of a mess to really tell - why don't you just start two of
your jobs manually in the background and clock them? For starters use
simply
time .. &
In perl I wouldn't use threads, it should be as simple as
#!/usr/bin/perl
sub run {
$children++;
if (fork() == 0) {
print "job $children started\n";
system $_[0];
print "job $children done\n";
exit 0;
}
}
run "sleep 1";
run "sleep 2";
#etc.
while ($children) { wait; $children--; }
print "Jobs done.\n";
Fino:sandbox$ ./tt
job 1 started
job 2 started
job 1 done
job 2 done
Jobs done.
(replace sleep by your R invocation ... use your imagination to
improve it since it's admittedly very crude but helps to track it
down ...)
Cheers,
Simon
> Technical details: OS was Ubuntu 9.04 running on a Core2Dou E7300, and
> the R version used was the default one from the Ubuntu repository.
>
> Please see http://www.perlmonks.org/?node_id=792460 for an extended
> discussion of the problem, and especially
> http://www.perlmonks.org/?node_id=793506 for excerpts of output and
> actual code.
>
> I have received several suggestions about R packages that would enable
> parallel processing in some way or other, and I'm thankful for those.
>
> However, at this point I'm interested in having two completely
> unrelated R processes that run simultaneously, not in parallel
> processing from within R.
> I have to admit that I'm an absolute beginner when it comes to R and
> this project will be finished before I could learn everything I'd need
> for a pure R solution. I'm familiar with perl, however, so I'd like
to
> stick to that.
>
> Thanks for your answers in advance and please excuse me if this causes
> too much noise:
>
> P?ter Juh?sz
> physicist
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>