thr3ads.net - R help - [R] Very Large Data Sets [Dec 1999]

If this information is useful, please help other people find it:
Share via:

Tony Fagan

1999-Dec-23 05:38 UTC

[R] Very Large Data Sets

List,

Can R handle very large data sets (say, 100 million records) for data mining
applications? My understanding is that Splus can not, but SAS can easily.

Thanks,
Tony Fagan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
https://stat.ethz.ch/pipermail/r-help/attachments/19991222/6f333667/attachment.html

Bill Venables

1999-Dec-23 06:05 UTC

head link

[R] Very Large Data Sets

Tony Fagan asks:
> List,
Sir,
> Can R handle very large data sets (say, 100 million records) for data 
> mining applications? 
The question assumes that the data handling capacity is a
property of the software alone, which is nonsense.  It is partly
a property of the software, partly of what you want to do with
the records, but mostly of the system on which it is run.
> My understanding is that Splus can not, but SAS can easily.
Try handling 100 million records with SAS (or anything else) on a
486 and see how easily it does it.

More seriously, the consensus is that on the same modern system
SAS is usually better able to handle large, dumb calculations
than S-PLUS, which is (generally) better than R.  Horses for
courses.

Bill Venables.
-- 
-----------------------------------------------------------------
Bill Venables, Statistician, CMIS Environmetrics Project.

Physical address:                            Postal address:
CSIRO Marine Laboratories,                   PO Box 120,       
233 Middle St, Cleveland, Queensland         Cleveland, Qld, 4163
AUSTRALIA                                    AUSTRALIA

Telephone: +61 7 3826 7251     Email: Bill.Venables at cmis.csiro.au

      Fax: +61 7 3826 7304


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Malewski

1999-Dec-23 07:32 UTC

head link

[R] Very Large Data Sets

Tony Fagan wrote:
> List, Can R handle very large data sets (say, 100 million records) for
> data mining applications? My understanding is that Splus can not, but
> SAS can easily. Thanks,Tony Fagan
>From a theoretical point of view yes, but practically:1) you'll need plenty of memory
2) even than the computation time will be long

In past times I used SPSS to create summarized data file (Mostly there is much
more data than really needed). Now I cut the data in a view records, write the
syntax, and than run the code in the night.

I think the ++ of R is the flexibility of the analyses not the data preparation
of very,very large data-bases.

Merry Xmas & a happy new year

Peter

--
** To YOU I'm an atheist; to God, I'm the Loyal Opposition. Woody Allen
**
P.Malewski                                      Tel.: 0531 500965
Maschplatz 8                                    Email: P.Malewski at tu-bs.de
************************38114 Braunschweig********************************



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

kmself@ix.netcom.com

1999-Dec-23 10:22 UTC

head link

[R] Very Large Data Sets

There are several components to this answer.  I'm not too well versed in
R, but I've run across the capacity question before.

R has a hard limit of 2 GB total memory, as I understand, and its data
model requires holding an entire set in memory.  This is very fast until
it isn't.  This limit applies even on 64 bit systems.

SAS can "process" a practically infinite data stream, one observation
at
a time (or more accurately, one read buffer at a time).  You can
approach this ideal using multiple-volume tape input on a number of OSs.
However, this ability is limited to simple and straightforward
processing -- DATA step and some very simple procedures.

Processing limits for various operations in SAS vary by OS, SAS version,
and operation.  For 32 bit OSs under releases up through 6.8 - 6.12, 2
GB RAM, 2 GB disk, and 32,767 (2^15 - 1) of many things were hard
limits.   For various reasons, the hard limits don't apply in all cases,
and workarounds were provided in several areas.

Under 64 bit OSs, these limits tend to be lifted, though occasionally 32
bit biases sneak through and bite you (there was one such bug in Proc SQL).  
Traditional limits such as the number of levels (and significant bytes
in character variables) treated by PROC FREQ have been greatly increased
in versions 7 and 8 of SAS.

Other limits are imposed more by the shear size of problems.  Many SAS
statistical procedures are based on IML and are limited by memory and
set size.  Even when large memory sets are supported, complex problems
with many levels may still exceed the capacity of any system.  Moreover,
complex statistics may make little sense on such large datasets.

When dealing with large datasets outside of SAS, my suggestion would be
to look to tools such as Perl and MySQL to handle the procedural and
relational processing of data, using R as an analytic tool.  Most simple
statistics (subsetting, aggregation, drilldown) can be accommodated
through these sorts of tools.   Think of the relationship to R as the
division as between the DATA step and SAS/STAT or SAS/GRAPH.

I would be interested to know of any data cube tools which are freely
available or available as free software.

On Wed, Dec 22, 1999 at 10:38:30PM -0700, Tony Fagan
wrote:> List,
> 
> Can R handle very large data sets (say, 100 million records) for data
mining applications? My understanding is that Splus can not, but SAS can easily.
> 
> Thanks,
> Tony Fagan
-- 
Karsten M. Self (kmself at ix.netcom.com)
    What part of "Gestalt" don't you understand?

SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list:  "subscribe sas-linux" to mailto:majordomo at
cranfield.ac.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 290 bytes
Desc: not available
Url :
https://stat.ethz.ch/pipermail/r-help/attachments/19991223/86196eaa/attachment.bin

Loren M. McCarter

1999-Dec-23 17:33 UTC

head link

[R] Very Large Data Sets

On Wed, 22 Dec 1999, Tony Fagan wrote:
> List,
> 
> Can R handle very large data sets (say, 100 million records) for data
mining applications? My understanding is that Splus can not, but SAS can easily.
> 
> Thanks,
> Tony Fagan
> 
There have been a couple of posts about approaching this large-dataset
problem with the MySQL/Python/R combination. I will simply add some
information (a testimonial) about my experiences with this as a possible
solution. This combination has worked very, very well for me. As a former
SAS and Windows user, I decided to perform my dissertation data analyses
using FreeBSD, which does not run SAS. After about a year of tinkering
around with different ways to approach the problem of analyzing my
dissertation data (i.e., moderately large ~1.5 million obs of
psychophysiological data), I have settled on this MySQL/Python/R
combination. In order to get to this stage, I looked into several other
solutions (e.g., Perl Data Language, PostgreSQL, Ox, APL, Perl, etc.), but
this combination met my needs best. 

For my purposes, I find this solution to be better than any other 
(including SAS). MySQL is very, very fast, especially when using
an index. Just last night, I could not believe how quickly it created
an R dataset for me (only 30 seconds on an slow machine---486DX
66Mhz---for a complex join of four tables, each table containing about
500K rows). For most data-analytic purposes, I go directly from (1)
subsetting the data in MySQL to (2) performing more sophisticated data
analyses in R. For some more complex queries, the Python
link is needed, but not for most (Python, of course, is useful for many
other reasons than linking from MySQL to R).

For my dissertation data, there is no reason for me to analyze all 1.5 
million rows at once. Rather, I need to perform the same statistical procedures,
one or two subjects at a time (i.e., 2400 rows), over and over again. I
let the SQL backend do the large, number-crunching work and let R shine
for statistics, and it really does shine...

Testimonially yours,

Loren

-------------------------------

Loren Michael McCarter
Graduate Student-UC Berkeley

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Reasonably Related Threads

Search for more maybe matching threads

R help - Dec 1999 - Very Large Data Sets

[R] Very Large Data Sets

[R] Very Large Data Sets

[R] Very Large Data Sets

[R] Very Large Data Sets

[R] Very Large Data Sets

Reasonably Related Threads