thr3ads.net - R help - [R] Large database help [May 2006]

If this information is useful, please help other people find it:
Share via:

Rogerio Porto

2006-May-16 01:33 UTC

[R] Large database help

Hello all.

I have a large .txt file whose variables are fixed-columns, 
ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
This is a 60GB file with 90 variables and 60 million observations.

I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
I tried the following code just to see if I could work with 2 variables
but it seems not possible: 
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.2.1  (2005-12-20 r36812)
ISBN 3-900051-07-0> gc()         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 169011  4.6     350000  9.4   350000  9.4
Vcells  62418  0.5     786432  6.0   289957  2.3> memory.limit(size=4090)
NULL> memory.limit()
[1] 4288675840> system.time(a<-matrix(runif(1e6),nrow=1))
[1] 0.28 0.02 2.42   NA   NA> gc()          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  171344  4.6     350000  9.4   350000  9.4
Vcells 1063212  8.2    3454398 26.4  4063230 31.0> rm(a)
> ls()
character(0)> system.time(a<-matrix(runif(60e6),nrow=1))Error: not possible to alocate vector of size 468750 Kb
Timing stopped at: 7.32 1.95 83.55 NA NA > memory.limit(size=5000)Erro em memory.size(size) : .....4GB

So my questions are:
1) (newbie) how can I read fixed-columns text files like this?
2) is there a way I can analyze (statistics like correlations, cluster etc)
    such a large database neither increasing RAM nor changing to 64bit
    machine but still using R and not using a sample? How? 

Thanks in advance.

Rogerio.

Uwe Ligges

2006-May-16 07:37 UTC

head link

[R] Large database help

Rogerio Porto wrote:
> Hello all.
> 
> I have a large .txt file whose variables are fixed-columns, 
> ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
> This is a 60GB file with 90 variables and 60 million observations.
> 
> I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
> I tried the following code just to see if I could work with 2 variables
> but it seems not possible: 
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.2.1  (2005-12-20 r36812)
> ISBN 3-900051-07-0
> 
>>gc()
> 
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 169011  4.6     350000  9.4   350000  9.4
> Vcells  62418  0.5     786432  6.0   289957  2.3
> 
>>memory.limit(size=4090)
> 
> NULL
> 
>>memory.limit()
> 
> [1] 4288675840
> 
>>system.time(a<-matrix(runif(1e6),nrow=1))
> 
> [1] 0.28 0.02 2.42   NA   NA
> 
>>gc()
> 
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  171344  4.6     350000  9.4   350000  9.4
> Vcells 1063212  8.2    3454398 26.4  4063230 31.0
> 
>>rm(a)
>>ls()
> 
> character(0)
> 
>>system.time(a<-matrix(runif(60e6),nrow=1))
> 
> Error: not possible to alocate vector of size 468750 Kb
> Timing stopped at: 7.32 1.95 83.55 NA NA 
> 
>>memory.limit(size=5000)
> 
> Erro em memory.size(size) : .....4GB
> 
> So my questions are:
> 1) (newbie) how can I read fixed-columns text files like this?
> 2) is there a way I can analyze (statistics like correlations, cluster etc)
>     such a large database neither increasing RAM nor changing to 64bit
>     machine but still using R and not using a sample? How? 

Use what you are already suggesting in your subject: a database.
Then you can access the variables separately and you have no problems 
reading the file.

Even with a real database, if you want to calculate on 60 million 
observations (~500Mb) at once, you are near the limit, but only works if 
you do not need several variables at once and depends on the methods you 
are going to apply.

Uwe Ligges

> Thanks in advance.
> 
> Rogerio.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Roger D. Peng

2006-May-16 11:55 UTC

head link

[R] Large database help

You can read fixed-width-files with read.fwf().  But my rough calculation says 
that your dataset will require 40GB of RAM.  I don't think you'll be
able to
read the entire thing into R.  Maybe look at a subset?

-roger

Rogerio Porto wrote:> Hello all.
> 
> I have a large .txt file whose variables are fixed-columns, 
> ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
> This is a 60GB file with 90 variables and 60 million observations.
> 
> I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
> I tried the following code just to see if I could work with 2 variables
> but it seems not possible: 
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.2.1  (2005-12-20 r36812)
> ISBN 3-900051-07-0
>> gc()
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 169011  4.6     350000  9.4   350000  9.4
> Vcells  62418  0.5     786432  6.0   289957  2.3
>> memory.limit(size=4090)
> NULL
>> memory.limit()
> [1] 4288675840
>> system.time(a<-matrix(runif(1e6),nrow=1))
> [1] 0.28 0.02 2.42   NA   NA
>> gc()
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  171344  4.6     350000  9.4   350000  9.4
> Vcells 1063212  8.2    3454398 26.4  4063230 31.0
>> rm(a)
>> ls()
> character(0)
>> system.time(a<-matrix(runif(60e6),nrow=1))
> Error: not possible to alocate vector of size 468750 Kb
> Timing stopped at: 7.32 1.95 83.55 NA NA 
>> memory.limit(size=5000)
> Erro em memory.size(size) : .....4GB
> 
> So my questions are:
> 1) (newbie) how can I read fixed-columns text files like this?
> 2) is there a way I can analyze (statistics like correlations, cluster etc)
>     such a large database neither increasing RAM nor changing to 64bit
>     machine but still using R and not using a sample? How? 
> 
> Thanks in advance.
> 
> Rogerio.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Apparently Analagous Threads

Search for more possibly parallel threads

R help - May 2006 - Large database help

[R] Large database help

[R] Large database help

[R] Large database help

Apparently Analagous Threads