Hi, I have two further comments/questions about large datasets in R. 1. Does R's ability to handle large datasets depend on the operating system's use of virtual memory? In theory, at least, VM should make the difference between installed RAM and virtual memory on a hard drive primarily a determinant of how fast R will calculate rather than whether or not it can do the calculations. However, if R has some low-level routines that have to be memory resident and use more memory as the amount of data increases, this may not hold. Can someone shed light on this? 2. Is What 64-bit versions of R are available at present? Marsh Feldman The University of Rhode Island -----Original Message----- From: Thomas Lumley [mailto:tlumley at u.washington.edu] Sent: Monday, July 17, 2006 3:21 PM To: Deepankar Basu Cc: r-help at stat.math.ethz.ch Subject: Re: [R] Large datasets in R On Mon, 17 Jul 2006, Deepankar Basu wrote:> Hi! > > I am a student of economics and currently do most of my statistical work > using STATA. For various reasons (not least of which is an aversion for > proprietary software), I am thinking of shifting to R. At the current > juncture my concern is the following: would I be able to work on > relatively large data-sets using R? For instance, I am currently working > on a data-set which is about 350MB in size. Would be possible to work > data-sets of such sizes using R?The answer depends on a lot of things, but most importantly 1) What you are going to do with the data 2) Whether you have a 32-bit or 64-bit version of R 3) How much memory your computer has. In a 32-bit version of R (where R will not be allowed to address more than 2-3Gb of memory) an object of size 350Mb is large enough to cause problems (see eg the R Installation and Adminstration Guide). If your 350Mb data set has lots of variables and you only use a few at a time then you may not have any trouble even on a 32-bit system once you have read in the data. If you have a 64-bit version of R and a few Gb of memory then there should be no real difficulty in working with that size of data set for most analyses. You might come across some analyses (eg some cluster analysis functions) that use n^2 memory for n observations and so break down. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
In my experience, the OS's use of virtual memory is only relevant in the rough sense that the OS can store *other* running applications in virtual memory so that R can use as much of the physical memory as possible. Once R itself overflows into virtual memory it quickly becomes unusable. I'm not sure I understand your second question. As R is available in source code form, it can be compiled for many 64-bit operating systems. -roger Marshall Feldman wrote:> Hi, > > I have two further comments/questions about large datasets in R. > > 1. Does R's ability to handle large datasets depend on the operating > system's use of virtual memory? In theory, at least, VM should make the > difference between installed RAM and virtual memory on a hard drive > primarily a determinant of how fast R will calculate rather than whether or > not it can do the calculations. However, if R has some low-level routines > that have to be memory resident and use more memory as the amount of data > increases, this may not hold. Can someone shed light on this? > > 2. Is What 64-bit versions of R are available at present? > > Marsh Feldman > The University of Rhode Island > > -----Original Message----- > From: Thomas Lumley [mailto:tlumley at u.washington.edu] > Sent: Monday, July 17, 2006 3:21 PM > To: Deepankar Basu > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] Large datasets in R > > On Mon, 17 Jul 2006, Deepankar Basu wrote: > >> Hi! >> >> I am a student of economics and currently do most of my statistical work >> using STATA. For various reasons (not least of which is an aversion for >> proprietary software), I am thinking of shifting to R. At the current >> juncture my concern is the following: would I be able to work on >> relatively large data-sets using R? For instance, I am currently working >> on a data-set which is about 350MB in size. Would be possible to work >> data-sets of such sizes using R? > > > The answer depends on a lot of things, but most importantly > 1) What you are going to do with the data > 2) Whether you have a 32-bit or 64-bit version of R > 3) How much memory your computer has. > > In a 32-bit version of R (where R will not be allowed to address more than > 2-3Gb of memory) an object of size 350Mb is large enough to cause problems > (see eg the R Installation and Adminstration Guide). > > If your 350Mb data set has lots of variables and you only use a few at a > time then you may not have any trouble even on a 32-bit system once you > have read in the data. > > If you have a 64-bit version of R and a few Gb of memory then there should > be no real difficulty in working with that size of data set for most > analyses. You might come across some analyses (eg some cluster analysis > functions) that use n^2 memory for n observations and so break down. > > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlumley at u.washington.edu University of Washington, Seattle > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Roger D. Peng | http://www.biostat.jhsph.edu/~rpeng/
On Tue, 18 Jul 2006, Marshall Feldman wrote:> Hi, > > I have two further comments/questions about large datasets in R. > > 1. Does R's ability to handle large datasets depend on the operating > system's use of virtual memory? In theory, at least, VM should make the > difference between installed RAM and virtual memory on a hard drive > primarily a determinant of how fast R will calculate rather than whether or > not it can do the calculations. However, if R has some low-level routines > that have to be memory resident and use more memory as the amount of data > increases, this may not hold. Can someone shed light on this?The issue is address space, not RAM. The limits Thomas mentions are on VM, not RAM, and it is common to have at least as much RAM installed as the VM address space for a user process. There is no low-level code in R that has any idea if it is memory-resident, nor AFAIK is there any portable way to do so in a user process in a modern OS. (R is as far as possible written to C99 and POSIX standards.)> 2. Is What 64-bit versions of R are available at present?Any OS with a 64-bit CPU that you can find a viable 64-bit compiler suite for. We've had 64-bit versions of R since the last millenium on Solaris, IRIX, HP-UX, OSF/1 and more recently on AIX, FreeBSD, Linux, MacOS X (on so-called G5) and probably others. The exception is probably Windows, for which there is no known free `viable 64-bit compiler suite', but it is likely that there are commercial ones.> > Marsh Feldman > The University of Rhode Island > > -----Original Message----- > From: Thomas Lumley [mailto:tlumley at u.washington.edu] > Sent: Monday, July 17, 2006 3:21 PM > To: Deepankar Basu > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] Large datasets in R > > On Mon, 17 Jul 2006, Deepankar Basu wrote: > > > Hi! > > > > I am a student of economics and currently do most of my statistical work > > using STATA. For various reasons (not least of which is an aversion for > > proprietary software), I am thinking of shifting to R. At the current > > juncture my concern is the following: would I be able to work on > > relatively large data-sets using R? For instance, I am currently working > > on a data-set which is about 350MB in size. Would be possible to work > > data-sets of such sizes using R? > > > The answer depends on a lot of things, but most importantly > 1) What you are going to do with the data > 2) Whether you have a 32-bit or 64-bit version of R > 3) How much memory your computer has. > > In a 32-bit version of R (where R will not be allowed to address more than > 2-3Gb of memory) an object of size 350Mb is large enough to cause problems > (see eg the R Installation and Adminstration Guide). > > If your 350Mb data set has lots of variables and you only use a few at a > time then you may not have any trouble even on a 32-bit system once you > have read in the data. > > If you have a 64-bit version of R and a few Gb of memory then there should > be no real difficulty in working with that size of data set for most > analyses. You might come across some analyses (eg some cluster analysis > functions) that use n^2 memory for n observations and so break down. > > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlumley at u.washington.edu University of Washington, Seattle > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595