thr3ads.net - R help - [R] Enormous Datasets [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Thomas W Volscho

2004-Nov-18 20:11 UTC

[R] Enormous Datasets

Dear List,
I have some projects where I use enormous datasets.  For instance, the 5% PUMS
microdata from the Census Bureau.  After deleting cases I may have a dataset
with 7 million+ rows and 50+ columns.  Will R handle a datafile of this size? 
If so, how?

Thank you in advance,
Tom Volscho

************************************        
Thomas W. Volscho
Graduate Student
Dept. of Sociology U-2068
University of Connecticut
Storrs, CT 06269
Phone: (860) 486-3882
http://vm.uconn.edu/~twv00001

Peter Dalgaard

2004-Nov-18 21:07 UTC

head link

[R] Enormous Datasets

Thomas W Volscho <THOMAS.VOLSCHO at huskymail.uconn.edu> writes:
> Dear List, I have some projects where I use enormous datasets. For
> instance, the 5% PUMS microdata from the Census Bureau. After
> deleting cases I may have a dataset with 7 million+ rows and 50+
> columns. Will R handle a datafile of this size? If so, how?
With a big machine... If that is numeric, non-integer data, you are
looking at something like 
> 7e6*50*8[1] 2.8e+09

i.e. roughly 3 GB of data for one copy of the data set. You easily
find yourself with multiple copies, so I suppose a machine with 16GB
RAM would cut it. These days that basically suggests x86_64
architecture running Linux (e.g. multiprocessor Opterons), but there
are also 64 bit Unix "big iron" solutions (Sun, IBM, HP,...).

If you can avoid dealing with the whole dataset at once, smaller
machines might get you there. Notice that 1 column is "only" 56MB, and
you may be able to work with aggregated data from some step onwards. 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907

Vadim Ogranovich

2004-Nov-18 21:13 UTC

head link

[R] Enormous Datasets

Very unlikely R will be able to handle this. The problems are:

* the data set may simply not fit into the memory
* it will take forever to read from the ASCII file
* any meaningful analysis of a dataset in R typically require 5 - 10
times more memory than the size of the dataset (unless you are a real
insider and know all the knobs)


Your best strategy is probably to partition the file in meaningful
sub-categories and work with them. To save time on conversion from ASCII
you can read the sub-files into a data frame and then save the data
frame in .rda file using save(). Subsequent loading .rda files is much
faster than reading ASCII

Another strategy which is often advocated on the list is to put the data
into a data base and draw random samples of manageable size from the
database. I have no experience with this approach

HTH,
Vadim
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Thomas 
> W Volscho
> Sent: Thursday, November 18, 2004 12:11 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Enormous Datasets
> 
> Dear List,
> I have some projects where I use enormous datasets.  For 
> instance, the 5% PUMS microdata from the Census Bureau.  
> After deleting cases I may have a dataset with 7 million+ 
> rows and 50+ columns.  Will R handle a datafile of this size? 
>  If so, how?
> 
> Thank you in advance,
> Tom Volscho
> 
> ************************************        
> Thomas W. Volscho
> Graduate Student
> Dept. of Sociology U-2068
> University of Connecticut
> Storrs, CT 06269
> Phone: (860) 486-3882
> http://vm.uconn.edu/~twv00001
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Liaw, Andy

2004-Nov-18 21:33 UTC

head link

[R] Enormous Datasets

It depends on what you want to do with that data in R.  If you want to play
with the whole data, just storing it in R will require more than 2.6GB of
memory (assuming all data are numeric and are stored as doubles):
> 7e6 * 50 * 8 / 1024^2[1] 2670.288

That's not impossible, but you'll need to be on a computer with quite a
bit
more memory than that, and running on an OS that supports it.  If that's not
feasible, you need to re-think what you want to do with that data in R
(e.g., read in and process a small chunk at a time, or read in a random
sample, etc.).

Andy

> From: Thomas W Volscho
> 
> Dear List,
> I have some projects where I use enormous datasets.  For 
> instance, the 5% PUMS microdata from the Census Bureau.  
> After deleting cases I may have a dataset with 7 million+ 
> rows and 50+ columns.  Will R handle a datafile of this size? 
>  If so, how?
> 
> Thank you in advance,
> Tom Volscho
> 
> ************************************        
> Thomas W. Volscho
> Graduate Student
> Dept. of Sociology U-2068
> University of Connecticut
> Storrs, CT 06269
> Phone: (860) 486-3882
> http://vm.uconn.edu/~twv00001
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Roger D. Peng

2004-Nov-18 22:36 UTC

head link

[R] Enormous Datasets

It depends on what you mean by 'handle', but probably not.  You'll 
likely have to split the file into multiple files unless you have some 
rather high end hardware.   However, in my limited experience, there's 
almost always a meaningful way to split the data (geographically, or 
by other categories).

A few things I've learned recently working with large datasets:

1.  Store files in .rda format using save() -- the load times are much 
faster and loading takes up less memory
2.  If your data are integers, store them as integers!
3.  Don't store character variables in dataframes -- use factors

-roger

Thomas W Volscho wrote:> Dear List,
> I have some projects where I use enormous datasets.  For instance, the 5%
PUMS microdata from the Census Bureau.  After deleting cases I may have a
dataset with 7 million+ rows and 50+ columns.  Will R handle a datafile of this
size?  If so, how?
> 
> Thank you in advance,
> Tom Volscho
> 
> ************************************        
> Thomas W. Volscho
> Graduate Student
> Dept. of Sociology U-2068
> University of Connecticut
> Storrs, CT 06269
> Phone: (860) 486-3882
> http://vm.uconn.edu/~twv00001
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
-- 
Roger D. Peng
http://www.biostat.jhsph.edu/~rpeng/

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Nov 2004 - Enormous Datasets

[R] Enormous Datasets

[R] Enormous Datasets

[R] Enormous Datasets

[R] Enormous Datasets

[R] Enormous Datasets

Possibly Parallel Threads