thr3ads.net - R help - [R] Handling large dataset & dataframe [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Sachin J

2006-Apr-24 17:41 UTC

[R] Handling large dataset & dataframe

Hi,
   
  I have a dataset consisting of 350,000 rows and 266 columns.  Out of 266
columns 250 are dummy variable columns. I am trying to read this data set into R
dataframe object but unable to do it due to memory size limitations (object size
created is too large to handle in R).  Is there a way to handle such a large
dataset in R.
   
  My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
   
  Any pointers would be of great help.
   
  TIA
  Sachin

		
---------------------------------

	[[alternative HTML version deleted]]

roger koenker

2006-Apr-24 17:51 UTC

head link

[R] Handling large dataset & dataframe

You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix,  but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...


url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    rkoenker at uiuc.edu            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> Hi,
>
>   I have a dataset consisting of 350,000 rows and 266 columns.  Out  
> of 266 columns 250 are dummy variable columns. I am trying to read  
> this data set into R dataframe object but unable to do it due to  
> memory size limitations (object size created is too large to handle  
> in R).  Is there a way to handle such a large dataset in R.
>
>   My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
>   Any pointers would be of great help.
>
>   TIA
>   Sachin
>
> 		
> ---------------------------------
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html

Gabor Grothendieck

2006-Apr-24 18:08 UTC

head link

[R] Handling large dataset & dataframe

You just need the much smaller cross product matrix X'X and vector X'Y
so you
can build those up as you read the data in in chunks.


On 4/24/06, Sachin J <sachinj.2006 at yahoo.com>
wrote:> Hi,
>
>  I have a dataset consisting of 350,000 rows and 266 columns.  Out of 266
columns 250 are dummy variable columns. I am trying to read this data set into R
dataframe object but unable to do it due to memory size limitations (object size
created is too large to handle in R).  Is there a way to handle such a large
dataset in R.
>
>  My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
>  Any pointers would be of great help.
>
>  TIA
>  Sachin
>
>
> ---------------------------------
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Richard M. Heiberger

2006-Apr-24 18:17 UTC

head link

[R] Handling large dataset & dataframe

Where is the excess size being identified?  Is it the read? or in the lm().

If it is in the reading of the data, then why are you reading the dummy
variables?
Would it make sense to read a single column of a factor instead of 80 columns
of dummy variables?

Liaw, Andy

2006-Apr-24 19:07 UTC

head link

[R] Handling large dataset & dataframe

Instead of reading the entire data in at once, you read a chunk at a time,
and compute X'X and X'y on that chunk, and accumulate (i.e., add) them.
There are examples in "S Programming", taken from independent replies
by the
two authors to a post on S-news, if I remember correctly.

Andy

From: Sachin J> 
> Gabor:
>    
>   Can you elaborate more.
>    
>   Thanx
>   Sachin
> 
> Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
>   You just need the much smaller cross product matrix X'X and 
> vector X'Y so you can build those up as you read the data in 
> in chunks.
> 
> 
> On 4/24/06, Sachin J wrote:
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out of 
> > 266 columns 250 are dummy variable columns. I am trying to 
> read this 
> > data set into R dataframe object but unable to do it due to memory 
> > size limitations (object size created is too large to 
> handle in R). Is 
> > there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> >
> 
> 
> 		
> ---------------------------------
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Mark Stephens

2006-Apr-25 12:39 UTC

head link

[R] Handling large dataset & dataframe

Sachin,
With your dummies stored as integer,  the size of your object would appear
to be 350000 * (4*250 + 8*16) bytes =  376MB.
You said "PC" but did not provide R version information,  assuming
windows
then ...
With 1GB RAM you should be able to load a 376MB object into memory.   If you
can store the dummies as 'raw' then object size is only 126MB.
You don't say how you attempted to load the data. Assuming your input data
is in text file (or can be) have you tried scan()? Setup the 'what'
argument
with length 266 and make sure the dummy column are set to integer() or
raw().  Then   x = scan(...);  class(x)=" data.frame".
What is the result of memory.limit()?  If it is 256MB or 512MB, then try
starting R with --max-mem-size=800M  (I forget the syntax exactly). Leave a
bit of room below 1GB.  Once the object is in memory R may need to copy it
once, or a few times. You may need to close all other apps in memory,  or
send them to swap.
I don't really see why your data should not fit into the memory you have.
Purchasing an extra 1GB may help.  Knowing the object size calculation (as
above) should help you guage whether it is worth it.
Have you used process monitor to see the memory growing as R loads the
data?  This can be useful.
If all the above fails,  then consider 64-bit and purchasing as much memory
as you can afford. R can use over 64GB RAM+ on 64bit machines. Maybe you can
hire some time on a 64-bit server farm - i heard its quite cheap but never
tried it myself.  You shouldn't need to go that far with this data set
though.
Hope this helps,
Mark

Hi Roger,

 I want to carry out regression analysis on this dataset. So I believe I
can't read the dataset in chunks. Any other solution?

 TIA
 Sachin

roger koenker < rkoenker@uiuc.edu> wrote:
 You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix, but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...

url: www.econ.uiuc.edu/~roger Roger Koenker
email rkoenker@uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820

On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> Hi,
>
> I have a dataset consisting of 350,000 rows and 266 columns. Out
> of 266 columns 250 are dummy variable columns. I am trying to read
> this data set into R dataframe object but unable to do it due to
> memory size limitations (object size created is too large to handle
> in R). Is there a way to handle such a large dataset in R.
>
> My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
> Any pointers would be of great help.
>
> TIA
> Sachin
>
	[[alternative HTML version deleted]]

Liaw, Andy

2006-Apr-25 20:43 UTC

head link

[R] Handling large dataset & dataframe

Much easier to use colClasses in read.table, and in many cases just as fast
(or even faster).

Andy

From: Mark Stephens> 
> From ?scan: "the *type* of what gives the type of data to be 
> read". So list(integer(), integer(), double(), raw(), ...) In 
> your code all columns are being read as character regardless 
> of the contents of the character vector.
> 
> I have to admit that I have added the *'s in *type*.  I have 
> been caught out by this too.  Its not the most convenient way 
> to specify the types of a large number of columns either.  As 
> you have a lot of columns you might want to do something like 
> this:  as.list(rep(integer(1),250)), assuming your dummies 
> are together, to save typing.  Also storage.mode() is useful 
> to tell you the precise type (and therefore size) of an 
> object e.g. sapply(coltypes,
> storage.mode) is actually the types scan() will use.  Note 
> that 'numeric' could be 'double' or 'integer' which
are
> important in your case to fit inside the 1GB limit, because 
> 'integer' (4 bytes) is half 'double' (8 bytes).
> 
> Perhaps someone on r-devel could enhance the documentation to 
> make "type" stand out in capitals in bold in help(scan)?  Or 
> maybe scan could be clever enough to accept a character 
> vector 'what'.  Or maybe I'm missing a good reason why this 
> isn't possible - anyone? How about allowing a character 
> vector length one, with each character representing the type 
> of that column e.g.  what="IIIIDDCD" would mean 4 integers 
> followed by 2 double's followed by a character column, 
> followed finally by a double column,  8 columns in total.  
> Probably someone somewhere has done that already, but I'm not 
> aware anyone has wrapped it up conveniently?
> 
> On 25/04/06, Sachin J <sachinj.2006 at yahoo.com> wrote:
> >
> >  Mark:
> >
> > Here is the information I didn't provide in my earlier 
> post. R version 
> > is R2.2.1 running on Windows XP.  My dataset has 16 variables with 
> > following data type.
> > ColNumber:   1              2              3  .......16
> > Datatypes:
> >
> > 
>
"numeric","numeric","numeric","numeric","numeric","numeric","character
> > 
>
","numeric","numeric","character","character","numeric","numeric","num
> >
eric","numeric","numeric","numeric","numeric"
> >
> > Variable (2) which is numeric and variables denoted as 
> character are 
> > to be treated as dummy variables in the regression.
> >
> > Search in R help list  suggested I can use read.csv with colClasses 
> > option also instead of using scan() and then converting it to 
> > dataframe as you suggested. I am trying both these methods 
> but unable 
> > to resolve syntactical error.
> >
> > >coltypes<-
> > 
>
c("numeric","factor","numeric","numeric","numeric","numeric","factor",
> > 
>
"numeric","numeric","factor","factor","numeric","numeric","numeric","n
> >
umeric","numeric","numeric","numeric")
> >
> > >mydf <- read.csv("C:/temp/data.csv", header=FALSE,
colClasses =
> > >coltypes,
> > strip.white=TRUE)
> >
> > ERROR: Error in scan(file = file, what = what, sep = sep, quote = 
> > quote, dec = dec,  :
> >         scan() expected 'a real', got 'V1'
> >
> > No idea whats the problem.
> >
> > AS PER YOUR SUGGESTION I TRIED scan() as follows:
> >
> >
> > 
>
>coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric
> > 
>
>","factor","numeric","numeric","factor","factor","numeric","n
>
umeric","numeric","numeric","numeric","numeric","numeric")
> > >x<-scan(file = 
>
"C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
> >
> > >names(x)<-scan(file =
"C:/temp/data.dbf",what="",nlines=1, sep=",")
> > >x<-as.data.frame(x)
> >
> > This is working fine but x has no data in it and contains
> > > x
> >
> >  [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6 
>  NA..7  NA..8
> > NA..9  NA..10 NA..11
> > [14] NA..12 NA..13 NA..14 NA..15 NA..16
> > <0 rows> (or 0-length row.names)
> >
> > Please let me know how to properly use scan or colClasses option.
> >
> > Sachin
> >
> >
> >
> >
> >
> > *Mark Stephens <markjs1 at googlemail.com>* wrote:
> >
> > Sachin,
> > With your dummies stored as integer, the size of your object would 
> > appear to be 350000 * (4*250 + 8*16) bytes = 376MB. You 
> said "PC" but 
> > did not provide R version information, assuming windows then ...
> > With 1GB RAM you should be able to load a 376MB object into 
> memory. If you
> > can store the dummies as 'raw' then object size is only 126MB.
> > You don't say how you attempted to load the data. Assuming 
> your input data
> > is in text file (or can be) have you tried scan()? Setup the
'what'
> > argument
> > with length 266 and make sure the dummy column are set to 
> integer() or
> > raw(). Then x = scan(...); class(x)=" data.frame".
> > What is the result of memory.limit()? If it is 256MB or 
> 512MB, then try
> > starting R with --max-mem-size=800M (I forget the syntax 
> exactly). Leave a
> > bit of room below 1GB. Once the object is in memory R may 
> need to copy it
> > once, or a few times. You may need to close all other apps 
> in memory, or
> > send them to swap.
> > I don't really see why your data should not fit into the 
> memory you have.
> > Purchasing an extra 1GB may help. Knowing the object size 
> calculation (as
> > above) should help you guage whether it is worth it.
> > Have you used process monitor to see the memory growing as 
> R loads the
> > data? This can be useful.
> > If all the above fails, then consider 64-bit and purchasing 
> as much memory
> > as you can afford. R can use over 64GB RAM+ on 64bit 
> machines. Maybe you
> > can
> > hire some time on a 64-bit server farm - i heard its quite 
> cheap but never
> > tried it myself. You shouldn't need to go that far with 
> this data set
> > though.
> > Hope this helps,
> > Mark
> >
> >
> > Hi Roger,
> >
> > I want to carry out regression analysis on this dataset. So 
> I believe 
> > I can't read the dataset in chunks. Any other solution?
> >
> > TIA
> > Sachin
> >
> >
> > roger koenker < rkoenker at uiuc.edu> wrote:
> > You can read chunks of it at a time and store it in sparse 
> matrix form 
> > using the packages SparseM or Matrix, but then you need to 
> think about 
> > what you want to do with it.... least squares sorts of 
> things are ok, 
> > but other options are somewhat limited...
> >
> >
> > url: www.econ.uiuc.edu/~roger Roger Koenker
> > email rkoenker at uiuc.edu Department of Economics
> > vox: 217-333-4558 University of Illinois
> > fax: 217-244-6678 Champaign, IL 61820
> >
> >
> > On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> >
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266 
> columns. Out of 
> > > 266 columns 250 are dummy variable columns. I am trying 
> to read this 
> > > data set into R dataframe object but unable to do it due 
> to memory 
> > > size limitations (object size created is too large to 
> handle in R). 
> > > Is there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows
XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > 
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/p
> > osting-guide.html>
> >
> >
> >  ------------------------------
> > Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone 
> calls. Great 
> > rates starting at 1?/min. 
> > 
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.
> > com/evt=39666/*http://beta.messenger.yahoo.com>
> >
> >
> 
> 	[[alternative HTML version deleted]]
> 
>

Maybe Matching Threads

Search for more seemingly similar threads

R help - Apr 2006 - Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

[R] Handling large dataset & dataframe

Maybe Matching Threads