bogdan romocea
2006-Jan-05 20:26 UTC
[R] Suggestion for big files [was: Re: A comment about R:]
ronggui wrote:> If i am familiar with > database software, using database (and R) is the best choice,but > convert the file into database format is not an easy job for me.Good working knowledge of a DBMS is almost invaluable when it comes to working with very large data sets. In addition, learning SQL is piece of cake compared to learning R. On top of that, knowledge of another (SQL) scripting language is not needed (except perhaps for special tasks): you can easily use R to generate the SQL syntax to import and work with arbitrarily wide tables. (I'm not familiar with SQLite, but MySQL comes with a command line tool that can run syntax files.) Better start learning SQL today.> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of ronggui > Sent: Thursday, January 05, 2006 12:48 PM > To: jim holtman > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] Suggestion for big files [was: Re: A comment > about R:] > > > 2006/1/6, jim holtman <jholtman at gmail.com>: > > If what you are reading in is numeric data, then it would > require (807 * > > 118519 * 8) 760MB just to store a single copy of the object > -- more memory > > than you have on your computer. If you were reading it in, > then the problem > > is the paging that was occurring. > In fact,If I read it in 3 pieces, each is about 170M. > > > > > You have to look at storing this in a database and working > on a subset of > > the data. Do you really need to have all 807 variables in > memory at the > > same time? > > Yip,I don't need all the variables.But I don't know how to get the > necessary variables into R. > > At last I read the data in piece and use RSQLite package to write it > to a database.and do then do the analysis. If i am familiar with > database software, using database (and R) is the best choice,but > convert the file into database format is not an easy job for me.I ask > for help in SQLite list,but the solution is not satisfying as that > required the knowledge about the third script language.After searching > the internet,I get this solution: > > #begin > rm(list=ls()) > f<-file("D:\wvsevs_sb_v4.csv","r") > i <- 0 > done <- FALSE > library(RSQLite) > con<-dbConnect("SQLite","c:\sqlite\database.db3") > tim1<-Sys.time() > > while(!done){ > i<-i+1 > tt<-readLines(f,2500) > if (length(tt)<2500) done <- TRUE > tt<-textConnection(tt) > if (i==1) { > assign("dat",read.table(tt,head=T,sep=",",quote="")); > } > else assign("dat",read.table(tt,head=F,sep=",",quote="")) > close(tt) > ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T), > dbWriteTable(con,"wvs",dat) ) > } > close(f) > #end > It's not the best solution,but it works. > > > > > If you use 'scan', you could specify that you do not want > some of the > > variables read in so it might make a more reasonably sized objects. > > > > > > On 1/5/06, Fran伱仹ois Pinard <pinard at iro.umontreal.ca> wrote: > > > [ronggui] > > > > > > >R's week when handling large data file. I has a data > file : 807 vars, > > > >118519 obs.and its CVS format. Stata can read it in in > 2 minus,but In > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. > > > > > > Just (another) thought. I used to use SPSS, many, many > years ago, on > > > CDC machines, where the CPU had limited memory and no > kind of paging > > > architecture. Files did not need to be very large for > being too large. > > > > > > SPSS had a feature that was then useful, about the capability of > > > sampling a big dataset directly at file read time, quite before > > > processing starts. Maybe something similar could help in > R (that is, > > > instead of reading the whole data in memory, _then_ sampling it.) > > > > > > One can read records from a file, up to a preset amount > of them. If the > > > file happens to contain more records than that preset > number (the number > > > of records in the whole file is not known beforehand), > already read > > > records may be dropped at random and replaced by other > records coming > > > from the file being read. If the random selection > algorithm is properly > > > chosen, it can be made so that all records in the > original file have > > > equal probability of being kept in the final subset. > > > > > > If such a sampling facility was built right within usual R reading > > > routines (triggered by an extra argument, say), it could offer > > > a compromise for processing large files, and also > sometimes accelerate > > > computations for big problems, even when memory is not at stake. > > > > > > -- > > > Fran伱仹ois Pinard http://pinard.progiciels-bpi.ca > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > -- > > Jim Holtman > > Cincinnati, OH > > +1 513 247 0281 > > > > What the problem you are trying to solve? > > > -- > 侀伝剚鑽仯佽伌伒 > Deparment of Sociology > Fudan University > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Neuro LeSuperHéros
2006-Jan-05 23:11 UTC
[R] Suggestion for big files [was: Re: A comment about R:]
Rongui, I'm not familiar with SQLite, but using MySQL would solve your problem. MySQL has a "LOAD DATA INFILE" statement that loads text/csv files rapidly. In R, assuming a test table exists in MySQL (blank table is fine), something like this would load the data directly in MySQL. library(DBI) library(RMySQL) dbSendQuery(mycon,"LOAD DATA INFILE 'C:/textfile.csv' INTO TABLE test3 FIELDS TERMINATED BY ',' ") #for csv files Then a normal SQL query would allow you to work with a manageable size of data.>From: bogdan romocea <br44114 at gmail.com> >To: ronggui.huang at gmail.com >CC: r-help <R-help at stat.math.ethz.ch> >Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:] >Date: Thu, 5 Jan 2006 15:26:51 -0500 > >ronggui wrote: > > If i am familiar with > > database software, using database (and R) is the best choice,but > > convert the file into database format is not an easy job for me. > >Good working knowledge of a DBMS is almost invaluable when it comes to >working with very large data sets. In addition, learning SQL is piece >of cake compared to learning R. On top of that, knowledge of another >(SQL) scripting language is not needed (except perhaps for special >tasks): you can easily use R to generate the SQL syntax to import and >work with arbitrarily wide tables. (I'm not familiar with SQLite, but >MySQL comes with a command line tool that can run syntax files.) >Better start learning SQL today. > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of ronggui > > Sent: Thursday, January 05, 2006 12:48 PM > > To: jim holtman > > Cc: r-help at stat.math.ethz.ch > > Subject: Re: [R] Suggestion for big files [was: Re: A comment > > about R:] > > > > > > 2006/1/6, jim holtman <jholtman at gmail.com>: > > > If what you are reading in is numeric data, then it would > > require (807 * > > > 118519 * 8) 760MB just to store a single copy of the object > > -- more memory > > > than you have on your computer. If you were reading it in, > > then the problem > > > is the paging that was occurring. > > In fact,If I read it in 3 pieces, each is about 170M. > > > > > > > > You have to look at storing this in a database and working > > on a subset of > > > the data. Do you really need to have all 807 variables in > > memory at the > > > same time? > > > > Yip,I don't need all the variables.But I don't know how to get the > > necessary variables into R. > > > > At last I read the data in piece and use RSQLite package to write it > > to a database.and do then do the analysis. If i am familiar with > > database software, using database (and R) is the best choice,but > > convert the file into database format is not an easy job for me.I ask > > for help in SQLite list,but the solution is not satisfying as that > > required the knowledge about the third script language.After searching > > the internet,I get this solution: > > > > #begin > > rm(list=ls()) > > f<-file("D:\wvsevs_sb_v4.csv","r") > > i <- 0 > > done <- FALSE > > library(RSQLite) > > con<-dbConnect("SQLite","c:\sqlite\database.db3") > > tim1<-Sys.time() > > > > while(!done){ > > i<-i+1 > > tt<-readLines(f,2500) > > if (length(tt)<2500) done <- TRUE > > tt<-textConnection(tt) > > if (i==1) { > > assign("dat",read.table(tt,head=T,sep=",",quote="")); > > } > > else assign("dat",read.table(tt,head=F,sep=",",quote="")) > > close(tt) > > ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T), > > dbWriteTable(con,"wvs",dat) ) > > } > > close(f) > > #end > > It's not the best solution,but it works. > > > > > > > > > If you use 'scan', you could specify that you do not want > > some of the > > > variables read in so it might make a more reasonably sized objects. > > > > > > > > > On 1/5/06, Fran????ois Pinard <pinard at iro.umontreal.ca> wrote: > > > > [ronggui] > > > > > > > > >R's week when handling large data file. I has a data > > file : 807 vars, > > > > >118519 obs.and its CVS format. Stata can read it in in > > 2 minus,but In > > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. > > > > > > > > Just (another) thought. I used to use SPSS, many, many > > years ago, on > > > > CDC machines, where the CPU had limited memory and no > > kind of paging > > > > architecture. Files did not need to be very large for > > being too large. > > > > > > > > SPSS had a feature that was then useful, about the capability of > > > > sampling a big dataset directly at file read time, quite before > > > > processing starts. Maybe something similar could help in > > R (that is, > > > > instead of reading the whole data in memory, _then_ sampling it.) > > > > > > > > One can read records from a file, up to a preset amount > > of them. If the > > > > file happens to contain more records than that preset > > number (the number > > > > of records in the whole file is not known beforehand), > > already read > > > > records may be dropped at random and replaced by other > > records coming > > > > from the file being read. If the random selection > > algorithm is properly > > > > chosen, it can be made so that all records in the > > original file have > > > > equal probability of being kept in the final subset. > > > > > > > > If such a sampling facility was built right within usual R reading > > > > routines (triggered by an extra argument, say), it could offer > > > > a compromise for processing large files, and also > > sometimes accelerate > > > > computations for big problems, even when memory is not at stake. > > > > > > > > -- > > > > Fran????ois Pinard http://pinard.progiciels-bpi.ca > > > > > > > > ______________________________________________ > > > > R-help at stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > > -- > > > Jim Holtman > > > Cincinnati, OH > > > +1 513 247 0281 > > > > > > What the problem you are trying to solve? > > > > > > -- > > ???????????????? > > Deparment of Sociology > > Fudan University > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! >http://www.R-project.org/posting-guide.html