Hi! I am a student of economics and currently do most of my statistical work using STATA. For various reasons (not least of which is an aversion for proprietary software), I am thinking of shifting to R. At the current juncture my concern is the following: would I be able to work on relatively large data-sets using R? For instance, I am currently working on a data-set which is about 350MB in size. Would be possible to work data-sets of such sizes using R? I have been trying to read up the posting on the R-archive on this topic; but I could not really understand all the discussion, nor could I reach the "end". So, I am not aware of the current state of consensus on the issue. It would help a lot if some current user could throw some light on this issue of large data-sets in R. Thanks in advance. Deepankar Basu
You may or may not have problems. R keeps its data in memory so you will have to have sufficient memory to hold the data plus all derived data and code. Since R is free you can try it out. If your problems are too large you can always get more memory or use S-Plus which can handle larger datasets and the code is similar to R so you can largely reuse your code. On 7/17/06, Deepankar Basu <basu.15 at osu.edu> wrote:> Hi! > > I am a student of economics and currently do most of my statistical work > using STATA. For various reasons (not least of which is an aversion for > proprietary software), I am thinking of shifting to R. At the current > juncture my concern is the following: would I be able to work on > relatively large data-sets using R? For instance, I am currently working > on a data-set which is about 350MB in size. Would be possible to work > data-sets of such sizes using R? > > I have been trying to read up the posting on the R-archive on this > topic; but I could not really understand all the discussion, nor could I > reach the "end". So, I am not aware of the current state of consensus on > the issue. > > It would help a lot if some current user could throw some light on this > issue of large data-sets in R. > > Thanks in advance. > > Deepankar Basu > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
On Mon, 17 Jul 2006, Deepankar Basu wrote:> Hi! > > I am a student of economics and currently do most of my statistical work > using STATA. For various reasons (not least of which is an aversion for > proprietary software), I am thinking of shifting to R. At the current > juncture my concern is the following: would I be able to work on > relatively large data-sets using R? For instance, I am currently working > on a data-set which is about 350MB in size. Would be possible to work > data-sets of such sizes using R?The answer depends on a lot of things, but most importantly 1) What you are going to do with the data 2) Whether you have a 32-bit or 64-bit version of R 3) How much memory your computer has. In a 32-bit version of R (where R will not be allowed to address more than 2-3Gb of memory) an object of size 350Mb is large enough to cause problems (see eg the R Installation and Adminstration Guide). If your 350Mb data set has lots of variables and you only use a few at a time then you may not have any trouble even on a 32-bit system once you have read in the data. If you have a 64-bit version of R and a few Gb of memory then there should be no real difficulty in working with that size of data set for most analyses. You might come across some analyses (eg some cluster analysis functions) that use n^2 memory for n observations and so break down. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Il giorno lun, 17/07/2006 alle 15.00 -0400, Deepankar Basu ha scritto:> I have been trying to read up the posting on the R-archive on this > topic; but I could not really understand all the discussion, nor could I > reach the "end". So, I am not aware of the current state of consensus on > the issue.a general hint is to store this dataset in a database and manage your data with RODBC or other db-related packages. cheers -- Daniele Medri
Thanks a lot for all the responses. The general drift of all the messages was the suggestion to use some database management package that has a nice interface with R; and most of the suggestions pointed in the direction of SQL. I will look into the SQL package and start learning to use it along with R. Thanks once again for all your suggestions. Deepankar