thr3ads.net - R help - [R] R performance--referred from Bioconductor listserv [Dec 2003]

If this information is useful, please help other people find it:
Share via:

Michael Benjamin

2003-Dec-04 00:31 UTC

[R] R performance--referred from Bioconductor listserv

Hi, all--
 
I wanted to start a (new) thread on R speed/benchmarking.  There is a
nice R benchmarking overview at
http://www.sciviews.org/other/benchmark.htm, along with a free script so
you can see how your machine stacks up.
 
Looks like R is substantially faster than S-plus.
 
My problem is this: with 512Mb and an overclocked AMD Athlon XP 1800+,
running at 588 SPEC-FP 2000, it still takes me 30 minutes to analyze 4Mb
.cel files x 120 files using affy (expresso).  Running svm takes a
mighty long time with more than 500 genes, 150 samples.
 
Questions:
1) Would adding RAM or processing speed improve performance the most?
2) Is it possible to run R on a cluster without rewriting my high-level
code?  In other words,
3) What are we going to do when we start collecting terabytes of array
data to analyze?  There will come a "breaking point" at which desktop
systems can't perform these analyses fast enough for large quantities of
data.  What then?
 
Michael Benjamin, MD
Winship Cancer Institute
Emory University,
Atlanta, GA
 
 

	[[alternative HTML version deleted]]

Robert Gentleman

2003-Dec-04 01:07 UTC

head link

[R] R performance--referred from Bioconductor listserv

Hi,
 Speed is an issue and large data sets are problematic. But I don't
 think that they are the entire problem here. Much more of the problem
 is that we don't yet know how to efficiently normalize microarrays
 and to estimate gene expression data. We're still trying to get it
 right rather than get it fast. There is not a lot of point in
 optimizing an algorithm that has a short shelf-life. And I don't
 think that anyone yet knows just which one will win.

 So, some of the issues are whether algorithms can be improved (and
they probably can; some form of binning would undoubtedly help with a
lot of what is going on in microarray analyses, but that requires that
the technology be somewhat more mature than it is now, at least that
is my view). 

Some gains can be made by cleaning up inefficient code (and newer
versions of the affy package have had some of that done). You can
explore this yourself (and I expect it is a bit more interesting than
benchmarking). The commands below profile the example and the output
(cut short) shows where time is being spent (interested readers are
referred to the man page). 

library(affy)
Rprof()
example(expresso)
Rprof(NULL)
summaryRprof()

$by.self
                                 self.time self.pct total.time
				 total.pct
"fft"                                 1.45     16.8       1.55
17.9
"read.dcf"                            0.42      4.9       0.63
7.3
".C"                                  0.35      4.1       0.35
4.1
"*"                                   0.30      3.5       0.30
3.5
"ifelse"                              0.28      3.3       0.87
10.1
"unique.default"                      0.28      3.3       0.30
3.5
":"                                   0.26      3.0       0.26
3.0
"names<-"                             0.23      2.7       0.28
3.3
"rep.default"                         0.23      2.7       0.23
2.7
"structure"                           0.23      2.7       1.08
12.5

So the bulk of the time is spent if fft; I think the first f is
important so you are unlikely to gain much there, the rest of the
self.time numbers suggest that there are not many gains to be had,
maybe a 20% gain with some serious reworking, maybe more.

But in other cases profiling is a great help (we recently made pretty
minor changes that resulted in major improvements),

 Robert

On Wed, Dec 03, 2003 at 07:31:49PM -0500, Michael Benjamin
wrote:> Hi, all--
>  
> I wanted to start a (new) thread on R speed/benchmarking.  There is a
> nice R benchmarking overview at
> http://www.sciviews.org/other/benchmark.htm, along with a free script so
> you can see how your machine stacks up.
>  
> Looks like R is substantially faster than S-plus.
>  
> My problem is this: with 512Mb and an overclocked AMD Athlon XP 1800+,
> running at 588 SPEC-FP 2000, it still takes me 30 minutes to analyze 4Mb
> .cel files x 120 files using affy (expresso).  Running svm takes a
> mighty long time with more than 500 genes, 150 samples.
>  
> Questions:
> 1) Would adding RAM or processing speed improve performance the most?
> 2) Is it possible to run R on a cluster without rewriting my high-level
> code?  In other words,
> 3) What are we going to do when we start collecting terabytes of array
> data to analyze?  There will come a "breaking point" at which
desktop
> systems can't perform these analyses fast enough for large quantities
of
> data.  What then?
>  
> Michael Benjamin, MD
> Winship Cancer Institute
> Emory University,
> Atlanta, GA
>  
>  
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
-- 
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20                            |
| Harvard School of Public Health  email: rgentlem at jimmy.harvard.edu        |
+---------------------------------------------------------------------------+

Roger D. Peng

2003-Dec-04 13:57 UTC

head link

[R] R performance--referred from Bioconductor listserv

Please see below.

Michael Benjamin wrote:> Hi, all--
>  
> I wanted to start a (new) thread on R speed/benchmarking.  There is a
> nice R benchmarking overview at
> http://www.sciviews.org/other/benchmark.htm, along with a free script so
> you can see how your machine stacks up.
>  
> Looks like R is substantially faster than S-plus.
>  
> My problem is this: with 512Mb and an overclocked AMD Athlon XP 1800+,
> running at 588 SPEC-FP 2000, it still takes me 30 minutes to analyze 4Mb
> .cel files x 120 files using affy (expresso).  Running svm takes a
> mighty long time with more than 500 genes, 150 samples.
>  
> Questions:
> 1) Would adding RAM or processing speed improve performance the most?
I usually find adding RAM makes a big difference, especially for Windows 
boxes.
> 2) Is it possible to run R on a cluster without rewriting my high-level
> code?  In other words,
I think the answer is most likely "no".  The `snow' package of 
Tierney/Rossini/Li on CRAN has gone a long way in making parallel 
computing in R much easier.
> 3) What are we going to do when we start collecting terabytes of array
> data to analyze?  There will come a "breaking point" at which
desktop
> systems can't perform these analyses fast enough for large quantities
of
> data.  What then?
Hasn't that "breaking point" always existed in some form or
another?  If
large datasets can be broken up then clusters can be useful because 
smaller chunks can be parceled out to the cluster nodes and processed. 
Another thing to think about is that as R moves into the world of 64 bit 
processors, we will be able to load much larger datasets into RAM.  I 
didn't think it was possible, but I recently loaded an 8GB dataset into 
R running on a Solaris/Sparc box!

-roger
>  
> Michael Benjamin, MD
> Winship Cancer Institute
> Emory University,
> Atlanta, GA
>  
>  
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Dec 2003 - R performance--referred from Bioconductor listserv

[R] R performance--referred from Bioconductor listserv

[R] R performance--referred from Bioconductor listserv

[R] R performance--referred from Bioconductor listserv

Apparently Analagous Threads