Hi, I have one critical question in using R. I am currently working on some research which involves huge amounts of data(it is about 15GB). I am trying to use R in this research rather than using SAS or STATA. (The company where I am working right now, is trying to switch SAS/STATA to R) As far as I know, the memory limit in R is 4GB; However, I believe that there are ways to handle the large dataset. Most of my works in R would be something like cleaning the data or running a simple regression(OLS/Logit) though. The whole company relies on me when it comes to R. Please teach me how to deal with large data in R. If you can, please give me a response very soon. Thank you very much. Regards, Hyo [[alternative HTML version deleted]]
Hi, On Aug 4, 2009, at 11:20 AM, Hyo Karen Lee wrote:> Hi, > I have one critical question in using R. > I am currently working on some research which involves huge amounts > of data(it is about 15GB). > I am trying to use R in this research rather than using SAS or STATA. > (The company where I am working right now, is trying to switch SAS/ > STATA to > R) > > As far as I know, the memory limit in R is 4GB;While that might be true on windows(?), I'm pretty/quite (positively, even) sure that's not true on 64bit linux/osx.> However, I believe that there are ways to handle the large dataset. > Most of my works in R would be something like cleaning the data or > running a > simple regression(OLS/Logit) though.One place to look would be the bigmemory package: http://cran.r-project.org/web/packages/bigmemory/ As well as the other packages listed in the High Performance Computing view on CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Specifically the "Large memory and out-of-memory data" section. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On > Behalf Of Hyo Karen Lee > Sent: Tuesday, August 04, 2009 8:21 AM > To: r-help at r-project.org > Subject: [R] One critical question in R > > Hi, > I have one critical question in using R. > I am currently working on some research which involves huge amounts > of data(it is about 15GB). > I am trying to use R in this research rather than using SAS or STATA. > (The company where I am working right now, is trying to switch SAS/STATA to > R) > > As far as I know, the memory limit in R is 4GB;The memory limit depends on your hardware and OS which you haven't told us about. With Linux and a 64-bit computer the limit MUCH higher. With 32-bit MS Windows OS you won't likely get even 3GB.> However, I believe that there are ways to handle the large dataset.You can use a database program like MySQL for example. If you have files that are on the order of 15GB in size, I don't thinlk you are going to have much success cleaning the data use R (well I know I wouldn't, but maybe one of the experts here can help you out). You may be able to use the biglm package for analuses, or read in just the data you need for your regressions. If you more help you will need to tell us more about what your data is like, with more specifics about what your analyses will look like.> Most of my works in R would be something like cleaning the data or running a > simple regression(OLS/Logit) though. > > The whole company relies on me when it comes to R. > Please teach me how to deal with large data in R. > If you can, please give me a response very soon. > Thank you very much. > > Regards, > Hyo >Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
On Tue, Aug 4, 2009 at 4:20 PM, Hyo Karen Lee<totemo22 at gmail.com> wrote:> I am currently working on some research which involves huge amounts > of data(it is about 15GB).One point nobody has seemed to make yet is that the above statement is meaningless... Do you have a CSV file that is 15GB big? The important number is the product of the numbers of rows and columns, not the file size. It takes 21 bytes to store "1.2345678901234567890" in a CSV file, but only 8 to store it in R. There's a reduction in size of nearly a factor of three. Or do you have an XLS file that is 15GB big? In which case, who knows how much bloat Microsoft have stuffed in there. Again, the important number is the product of the numbers of rows and columns. The fundamental thing is the number of numbers (and factors), not the file size. Barry