I am a SAS user currently evaluating R as a possible addition or even replacement for SAS. The difficulty I have come across straight away is R's apparent difficulty in handling relatively large data files. Whilst I would not expect it to handle datasets with millions of records, I still really need to be able to work with dataset with 100,000+ records and 100+ variables. Yet, when reading a .csv file with 180,000 records and about 200 variables, the software virtually ground to a halt (I stopped it after 1 hour). Are there guidelines or maybe a limitations document anywhere that helps me assess the size of file that R, generally, or specific routines will handle? Also, mindful of the fact that I am am an R novice, are there guidelines to make efficient use of R in terms of data handling? Many thanks in advance for your help. Regards, Fabiano Vergari fab.vergari@googlemail.com [[alternative HTML version deleted]]
On Fri, Aug 31, 2007 at 01:31:12PM +0100, Fabiano Vergari wrote:> I am a SAS user currently evaluating R as a possible addition or > even replacement for SAS. The difficulty I have come across straight > away is R's apparent difficulty in handling relatively large data > files. Whilst I would not expect it to handle datasets with millions > of records, I still really need to be able to work with dataset with > 100,000+ records and 100+ variables. Yet, when reading a .csv file > with 180,000 records and about 200 variables, the software virtually > ground to a halt (I stopped it after 1 hour). Are there guidelines > or maybe a limitations document anywhere that helps me assess the > size180k records with 200 variables = 36 million entries, if they're numeric then they're doubles taking up 8 bytes, so 288 MB of RAM. This should be perfectly fine for R, as long as you have that much free RAM. However, the routines that read CSV and tabular delimited files are relatively inefficient for such large files. In order to handle large data files, it is better to use one of the database interfaces. My preference would be sqlite unless I already had the data on a mysql or other database server. the documentation for the packages RSQLite and SQLiteDF should be helpful, as well as the documentation for SQLite itself, which has a facility for efficiently importing CSV and similar files directly to a SQLite database. eg: http://netadmintools.com/art572.html -- Daniel Lakeland dlakelan at street-artists.org http://www.street-artists.org/~dlakelan
SAS was developed many years ago when computers were far less powerful so its heritage is that it is very efficient and its unlikely that R or other modern software will match SAS in that respect. The development version of the sqldf R package provides an interface which simplifies the use of the R package RSQLite which in turn is an interface to the sqlite database. The development version of sqldf supports RSQLite's ability to read a file directly to sqlite without going through R and then reading it from there or reading a subset of it from there into R. See example 6 on the sqldf home page: http://code.google.com/p/sqldf/ On 8/31/07, Fabiano Vergari <fab.vergari at googlemail.com> wrote:> I am a SAS user currently evaluating R as a possible addition or even > replacement for SAS. The difficulty I have come across > > straight away is R's apparent difficulty in handling relatively large data > files. Whilst I would not expect it to handle > > datasets with millions of records, I still really need to be able to work > with dataset with 100,000+ records and 100+ > > variables. Yet, when reading a .csv file with 180,000 records and about 200 > variables, the software virtually ground to a > > halt (I stopped it after 1 hour). Are there guidelines or maybe a > limitations document anywhere that helps me assess the size > > of file that R, generally, or specific routines will handle? Also, mindful > of the fact that I am am an R novice, are there > guidelines to make efficient use of R in terms of data handling? > > Many thanks in advance for your help. > > Regards, > Fabiano Vergari > fab.vergari at googlemail.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >