List, Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. Thanks, Tony Fagan -------------- next part -------------- An HTML attachment was scrubbed... URL: https://stat.ethz.ch/pipermail/r-help/attachments/19991222/6f333667/attachment.html
Tony Fagan asks:> List,Sir,> Can R handle very large data sets (say, 100 million records) for data > mining applications?The question assumes that the data handling capacity is a property of the software alone, which is nonsense. It is partly a property of the software, partly of what you want to do with the records, but mostly of the system on which it is run.> My understanding is that Splus can not, but SAS can easily.Try handling 100 million records with SAS (or anything else) on a 486 and see how easily it does it. More seriously, the consensus is that on the same modern system SAS is usually better able to handle large, dumb calculations than S-PLUS, which is (generally) better than R. Horses for courses. Bill Venables. -- ----------------------------------------------------------------- Bill Venables, Statistician, CMIS Environmetrics Project. Physical address: Postal address: CSIRO Marine Laboratories, PO Box 120, 233 Middle St, Cleveland, Queensland Cleveland, Qld, 4163 AUSTRALIA AUSTRALIA Telephone: +61 7 3826 7251 Email: Bill.Venables at cmis.csiro.au Fax: +61 7 3826 7304 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Tony Fagan wrote:> List, Can R handle very large data sets (say, 100 million records) for > data mining applications? My understanding is that Splus can not, but > SAS can easily. Thanks,Tony Fagan>From a theoretical point of view yes, but practically:1) you'll need plenty of memory 2) even than the computation time will be long In past times I used SPSS to create summarized data file (Mostly there is much more data than really needed). Now I cut the data in a view records, write the syntax, and than run the code in the night. I think the ++ of R is the flexibility of the analyses not the data preparation of very,very large data-bases. Merry Xmas & a happy new year Peter -- ** To YOU I'm an atheist; to God, I'm the Loyal Opposition. Woody Allen ** P.Malewski Tel.: 0531 500965 Maschplatz 8 Email: P.Malewski at tu-bs.de ************************38114 Braunschweig******************************** -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
There are several components to this answer. I'm not too well versed in R, but I've run across the capacity question before. R has a hard limit of 2 GB total memory, as I understand, and its data model requires holding an entire set in memory. This is very fast until it isn't. This limit applies even on 64 bit systems. SAS can "process" a practically infinite data stream, one observation at a time (or more accurately, one read buffer at a time). You can approach this ideal using multiple-volume tape input on a number of OSs. However, this ability is limited to simple and straightforward processing -- DATA step and some very simple procedures. Processing limits for various operations in SAS vary by OS, SAS version, and operation. For 32 bit OSs under releases up through 6.8 - 6.12, 2 GB RAM, 2 GB disk, and 32,767 (2^15 - 1) of many things were hard limits. For various reasons, the hard limits don't apply in all cases, and workarounds were provided in several areas. Under 64 bit OSs, these limits tend to be lifted, though occasionally 32 bit biases sneak through and bite you (there was one such bug in Proc SQL). Traditional limits such as the number of levels (and significant bytes in character variables) treated by PROC FREQ have been greatly increased in versions 7 and 8 of SAS. Other limits are imposed more by the shear size of problems. Many SAS statistical procedures are based on IML and are limited by memory and set size. Even when large memory sets are supported, complex problems with many levels may still exceed the capacity of any system. Moreover, complex statistics may make little sense on such large datasets. When dealing with large datasets outside of SAS, my suggestion would be to look to tools such as Perl and MySQL to handle the procedural and relational processing of data, using R as an analytic tool. Most simple statistics (subsetting, aggregation, drilldown) can be accommodated through these sorts of tools. Think of the relationship to R as the division as between the DATA step and SAS/STAT or SAS/GRAPH. I would be interested to know of any data cube tools which are freely available or available as free software. On Wed, Dec 22, 1999 at 10:38:30PM -0700, Tony Fagan wrote:> List, > > Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. > > Thanks, > Tony Fagan-- Karsten M. Self (kmself at ix.netcom.com) What part of "Gestalt" don't you understand? SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html Mailing list: "subscribe sas-linux" to mailto:majordomo at cranfield.ac.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 290 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/19991223/86196eaa/attachment.bin
On Wed, 22 Dec 1999, Tony Fagan wrote:> List, > > Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. > > Thanks, > Tony Fagan >There have been a couple of posts about approaching this large-dataset problem with the MySQL/Python/R combination. I will simply add some information (a testimonial) about my experiences with this as a possible solution. This combination has worked very, very well for me. As a former SAS and Windows user, I decided to perform my dissertation data analyses using FreeBSD, which does not run SAS. After about a year of tinkering around with different ways to approach the problem of analyzing my dissertation data (i.e., moderately large ~1.5 million obs of psychophysiological data), I have settled on this MySQL/Python/R combination. In order to get to this stage, I looked into several other solutions (e.g., Perl Data Language, PostgreSQL, Ox, APL, Perl, etc.), but this combination met my needs best. For my purposes, I find this solution to be better than any other (including SAS). MySQL is very, very fast, especially when using an index. Just last night, I could not believe how quickly it created an R dataset for me (only 30 seconds on an slow machine---486DX 66Mhz---for a complex join of four tables, each table containing about 500K rows). For most data-analytic purposes, I go directly from (1) subsetting the data in MySQL to (2) performing more sophisticated data analyses in R. For some more complex queries, the Python link is needed, but not for most (Python, of course, is useful for many other reasons than linking from MySQL to R). For my dissertation data, there is no reason for me to analyze all 1.5 million rows at once. Rather, I need to perform the same statistical procedures, one or two subjects at a time (i.e., 2400 rows), over and over again. I let the SQL backend do the large, number-crunching work and let R shine for statistics, and it really does shine... Testimonially yours, Loren ------------------------------- Loren Michael McCarter Graduate Student-UC Berkeley -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._