Hello all. I have a large .txt file whose variables are fixed-columns, ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc. This is a 60GB file with 90 variables and 60 million observations. I'm working with a Pentium 4, 1GB RAM, Windows XP Pro. I tried the following code just to see if I could work with 2 variables but it seems not possible: R : Copyright 2005, The R Foundation for Statistical Computing Version 2.2.1 (2005-12-20 r36812) ISBN 3-900051-07-0> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 169011 4.6 350000 9.4 350000 9.4 Vcells 62418 0.5 786432 6.0 289957 2.3> memory.limit(size=4090)NULL> memory.limit()[1] 4288675840> system.time(a<-matrix(runif(1e6),nrow=1))[1] 0.28 0.02 2.42 NA NA> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 171344 4.6 350000 9.4 350000 9.4 Vcells 1063212 8.2 3454398 26.4 4063230 31.0> rm(a) > ls()character(0)> system.time(a<-matrix(runif(60e6),nrow=1))Error: not possible to alocate vector of size 468750 Kb Timing stopped at: 7.32 1.95 83.55 NA NA> memory.limit(size=5000)Erro em memory.size(size) : .....4GB So my questions are: 1) (newbie) how can I read fixed-columns text files like this? 2) is there a way I can analyze (statistics like correlations, cluster etc) such a large database neither increasing RAM nor changing to 64bit machine but still using R and not using a sample? How? Thanks in advance. Rogerio.
Rogerio Porto wrote:> Hello all. > > I have a large .txt file whose variables are fixed-columns, > ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc. > This is a 60GB file with 90 variables and 60 million observations. > > I'm working with a Pentium 4, 1GB RAM, Windows XP Pro. > I tried the following code just to see if I could work with 2 variables > but it seems not possible: > R : Copyright 2005, The R Foundation for Statistical Computing > Version 2.2.1 (2005-12-20 r36812) > ISBN 3-900051-07-0 > >>gc() > > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 169011 4.6 350000 9.4 350000 9.4 > Vcells 62418 0.5 786432 6.0 289957 2.3 > >>memory.limit(size=4090) > > NULL > >>memory.limit() > > [1] 4288675840 > >>system.time(a<-matrix(runif(1e6),nrow=1)) > > [1] 0.28 0.02 2.42 NA NA > >>gc() > > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 171344 4.6 350000 9.4 350000 9.4 > Vcells 1063212 8.2 3454398 26.4 4063230 31.0 > >>rm(a) >>ls() > > character(0) > >>system.time(a<-matrix(runif(60e6),nrow=1)) > > Error: not possible to alocate vector of size 468750 Kb > Timing stopped at: 7.32 1.95 83.55 NA NA > >>memory.limit(size=5000) > > Erro em memory.size(size) : .....4GB > > So my questions are: > 1) (newbie) how can I read fixed-columns text files like this? > 2) is there a way I can analyze (statistics like correlations, cluster etc) > such a large database neither increasing RAM nor changing to 64bit > machine but still using R and not using a sample? How?Use what you are already suggesting in your subject: a database. Then you can access the variables separately and you have no problems reading the file. Even with a real database, if you want to calculate on 60 million observations (~500Mb) at once, you are near the limit, but only works if you do not need several variables at once and depends on the methods you are going to apply. Uwe Ligges> Thanks in advance. > > Rogerio. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
You can read fixed-width-files with read.fwf(). But my rough calculation says that your dataset will require 40GB of RAM. I don't think you'll be able to read the entire thing into R. Maybe look at a subset? -roger Rogerio Porto wrote:> Hello all. > > I have a large .txt file whose variables are fixed-columns, > ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc. > This is a 60GB file with 90 variables and 60 million observations. > > I'm working with a Pentium 4, 1GB RAM, Windows XP Pro. > I tried the following code just to see if I could work with 2 variables > but it seems not possible: > R : Copyright 2005, The R Foundation for Statistical Computing > Version 2.2.1 (2005-12-20 r36812) > ISBN 3-900051-07-0 >> gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 169011 4.6 350000 9.4 350000 9.4 > Vcells 62418 0.5 786432 6.0 289957 2.3 >> memory.limit(size=4090) > NULL >> memory.limit() > [1] 4288675840 >> system.time(a<-matrix(runif(1e6),nrow=1)) > [1] 0.28 0.02 2.42 NA NA >> gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 171344 4.6 350000 9.4 350000 9.4 > Vcells 1063212 8.2 3454398 26.4 4063230 31.0 >> rm(a) >> ls() > character(0) >> system.time(a<-matrix(runif(60e6),nrow=1)) > Error: not possible to alocate vector of size 468750 Kb > Timing stopped at: 7.32 1.95 83.55 NA NA >> memory.limit(size=5000) > Erro em memory.size(size) : .....4GB > > So my questions are: > 1) (newbie) how can I read fixed-columns text files like this? > 2) is there a way I can analyze (statistics like correlations, cluster etc) > such a large database neither increasing RAM nor changing to 64bit > machine but still using R and not using a sample? How? > > Thanks in advance. > > Rogerio. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Possibly Parallel Threads
- cannot.allocate.memory.again and 32bit<--->64bit
- Suggestion for memory optimization and as.double() with friends
- Cannot allocate large vectors (running out of memory?)
- The fastest way to select and execute a few selected functions inside a function
- isoreg memory leak?