Dear List, I have some projects where I use enormous datasets. For instance, the 5% PUMS microdata from the Census Bureau. After deleting cases I may have a dataset with 7 million+ rows and 50+ columns. Will R handle a datafile of this size? If so, how? Thank you in advance, Tom Volscho ************************************ Thomas W. Volscho Graduate Student Dept. of Sociology U-2068 University of Connecticut Storrs, CT 06269 Phone: (860) 486-3882 http://vm.uconn.edu/~twv00001
Thomas W Volscho <THOMAS.VOLSCHO at huskymail.uconn.edu> writes:> Dear List, I have some projects where I use enormous datasets. For > instance, the 5% PUMS microdata from the Census Bureau. After > deleting cases I may have a dataset with 7 million+ rows and 50+ > columns. Will R handle a datafile of this size? If so, how?With a big machine... If that is numeric, non-integer data, you are looking at something like> 7e6*50*8[1] 2.8e+09 i.e. roughly 3 GB of data for one copy of the data set. You easily find yourself with multiple copies, so I suppose a machine with 16GB RAM would cut it. These days that basically suggests x86_64 architecture running Linux (e.g. multiprocessor Opterons), but there are also 64 bit Unix "big iron" solutions (Sun, IBM, HP,...). If you can avoid dealing with the whole dataset at once, smaller machines might get you there. Notice that 1 column is "only" 56MB, and you may be able to work with aggregated data from some step onwards. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Very unlikely R will be able to handle this. The problems are: * the data set may simply not fit into the memory * it will take forever to read from the ASCII file * any meaningful analysis of a dataset in R typically require 5 - 10 times more memory than the size of the dataset (unless you are a real insider and know all the knobs) Your best strategy is probably to partition the file in meaningful sub-categories and work with them. To save time on conversion from ASCII you can read the sub-files into a data frame and then save the data frame in .rda file using save(). Subsequent loading .rda files is much faster than reading ASCII Another strategy which is often advocated on the list is to put the data into a data base and draw random samples of manageable size from the database. I have no experience with this approach HTH, Vadim> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Thomas > W Volscho > Sent: Thursday, November 18, 2004 12:11 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Enormous Datasets > > Dear List, > I have some projects where I use enormous datasets. For > instance, the 5% PUMS microdata from the Census Bureau. > After deleting cases I may have a dataset with 7 million+ > rows and 50+ columns. Will R handle a datafile of this size? > If so, how? > > Thank you in advance, > Tom Volscho > > ************************************ > Thomas W. Volscho > Graduate Student > Dept. of Sociology U-2068 > University of Connecticut > Storrs, CT 06269 > Phone: (860) 486-3882 > http://vm.uconn.edu/~twv00001 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
It depends on what you want to do with that data in R. If you want to play with the whole data, just storing it in R will require more than 2.6GB of memory (assuming all data are numeric and are stored as doubles):> 7e6 * 50 * 8 / 1024^2[1] 2670.288 That's not impossible, but you'll need to be on a computer with quite a bit more memory than that, and running on an OS that supports it. If that's not feasible, you need to re-think what you want to do with that data in R (e.g., read in and process a small chunk at a time, or read in a random sample, etc.). Andy> From: Thomas W Volscho > > Dear List, > I have some projects where I use enormous datasets. For > instance, the 5% PUMS microdata from the Census Bureau. > After deleting cases I may have a dataset with 7 million+ > rows and 50+ columns. Will R handle a datafile of this size? > If so, how? > > Thank you in advance, > Tom Volscho > > ************************************ > Thomas W. Volscho > Graduate Student > Dept. of Sociology U-2068 > University of Connecticut > Storrs, CT 06269 > Phone: (860) 486-3882 > http://vm.uconn.edu/~twv00001 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
It depends on what you mean by 'handle', but probably not. You'll likely have to split the file into multiple files unless you have some rather high end hardware. However, in my limited experience, there's almost always a meaningful way to split the data (geographically, or by other categories). A few things I've learned recently working with large datasets: 1. Store files in .rda format using save() -- the load times are much faster and loading takes up less memory 2. If your data are integers, store them as integers! 3. Don't store character variables in dataframes -- use factors -roger Thomas W Volscho wrote:> Dear List, > I have some projects where I use enormous datasets. For instance, the 5% PUMS microdata from the Census Bureau. After deleting cases I may have a dataset with 7 million+ rows and 50+ columns. Will R handle a datafile of this size? If so, how? > > Thank you in advance, > Tom Volscho > > ************************************ > Thomas W. Volscho > Graduate Student > Dept. of Sociology U-2068 > University of Connecticut > Storrs, CT 06269 > Phone: (860) 486-3882 > http://vm.uconn.edu/~twv00001 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Roger D. Peng http://www.biostat.jhsph.edu/~rpeng/