Folks:
I am trying to read in a large file. Definition of large is:
Number of lines: 333, 250
Size: 850 MB
The maching is a dual core intel, with 4 GB RAM and nothing else running on it.
I read the previous threads on read.fwf and did not see any conclusive
statements on how to read fast. Example record and R code given below. I was
hoping to purchase a better machine and do analysis with larger datasets - but
these preliminary results do not look good.
Does anyone have any experience with large files (> 1GB) and using them with
Revolution-R?
Thanks.
Satish
Example Code
key_vec <- c(1,3,3,4,2,8,8,2,2,3,2,2,1,3,3,3,3,9)
key_names <-
c("allgeo","area1","zone","dist","ccust1","whse","bindc","ccust2","account","area2","ccust3","customer","allprod","cat","bu","class","size","bdc")
key_info <- data.frame(key_vec,key_names)
col_names <- c(key_names,sas_time$week)
num_buckets <- rep(12,209)
width_vec = c(key_vec,num_buckets)
col_classes<-c(rep("factor",18),rep("numeric",209))
#threewkoutstat <-
read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes,n=100)
threewkoutstat <-
read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes)
names(threewkoutstat) <- col_names
Example record (only one record pasted below)
A004001003799000049250000492599990049999A001002002015002015009 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.60 0.60 0.60 0.70 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
Folks: Suppose I divide USA into 16 regions. My end goal is to run data mining / analysis on each of these 16 regions. The data for each of these regions (sales, forecast, etc.) will be in the range of 10-20 GB. At one time, I will need to load say 15 GB into R and then do analysis. Is this something other R users are doing? Or, is it better to switch to SAS? Could you help me with any information on this? Thanks. Satish -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1469700.html Sent from the R help mailing list archive at Nabble.com.
Folks: Can anyone throw some light on this? Thanks. Satish ----- Satish Vadlamani -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470169.html Sent from the R help mailing list archive at Nabble.com.
Where should be shine it? No information provided on operating system, version, memory, size of files, what you want to do with them, etc. Lot of options: put it in a database, read partial file (lines and/or columns), preprocess, etc. Your option. On Fri, Feb 5, 2010 at 8:03 AM, Satish Vadlamani <SATISH.VADLAMANI at fritolay.com> wrote:> > Folks: > Can anyone throw some light on this? Thanks. > Satish > > > ----- > Satish Vadlamani > -- > View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470169.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
On Thu, Feb 4, 2010 at 5:27 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:> Folks:
> I am trying to read in a large file. Definition of large is:
> Number of lines: 333, 250
> Size: 850 MB
Perhaps this post by JD Long will provide an example that is suitable
to your situation:
http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/
Hope it helps!
-Charlie
Hello,
Do you need /all/ the data in memory at one time? Is your goal to
divide the data (e.g according to some factor /or/ some function of
the columns of data set ) and then analyze the divisions? And then,
possibly, combine the results ?
If so, you might consider using Rhipe. We have analyzed (e.g get
regression parameters, apply algorithms) across subsets of data where
the subsets are created according to some condition.
Using this approach(and a cluster of 8 machines, 72 cores) we
successfully analyzed data sets ranging from 14GB to ~140GB .
This all assumes that your divisions are suitably small - i notice
you mention that each region is 10-20 GB and you want to compute on
/all/ i.e you need all of it in memory. If so, Rhipe cannot help you.
Regards
Saptarshi
On Thu, Feb 4, 2010 at 8:27 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:> Folks:
> I am trying to read in a large file. Definition of large is:
> Number of lines: 333, 250
> Size: 850 MB
>
> The maching is a dual core intel, with 4 GB RAM and nothing else running on
it. I read the previous threads on read.fwf and did not see any conclusive
statements on how to read fast. Example record and R code given below. I was
hoping to purchase a better machine and do analysis with larger datasets - but
these preliminary results do not look good.
>
> Does anyone have any experience with large files (> 1GB) and using them
with Revolution-R?
>
>
> Thanks.
>
> Satish
>
> Example Code
> key_vec <- c(1,3,3,4,2,8,8,2,2,3,2,2,1,3,3,3,3,9)
> key_names <-
c("allgeo","area1","zone","dist","ccust1","whse","bindc","ccust2","account","area2","ccust3","customer","allprod","cat","bu","class","size","bdc")
> key_info <- data.frame(key_vec,key_names)
> col_names <- c(key_names,sas_time$week)
> num_buckets <- rep(12,209)
> width_vec = c(key_vec,num_buckets)
> col_classes<-c(rep("factor",18),rep("numeric",209))
> #threewkoutstat <-
read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes,n=100)
> threewkoutstat <-
read.fwf(file="3wkoutstatfcst_file02.dat",widths=width_vec,header=FALSE,colClasses=col_classes)
> names(threewkoutstat) <- col_names
>
> Example record (only one record pasted below)
> A004001003799000049250000492599990049999A001002002015002015009 ? ? ? ?0.00
? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? !
> ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.60 ? ? ?
?0.60 ? ? ? ?0.60 ? ? ? ?0.70 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? !
> ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00
> ? 0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ?
?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ?
? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00 ? ? ? ?0.00
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>