Hello everyone, I currently run R code that have to read 100 or more large csv files (>= 100 Mo), and usually write csv too. My collegues and I like R very much but are a little bit ashtonished by how slow those functions are. We have looked on every argument of those functions and if specifying some parameters help a bit, this is still too slow. I am sure a lot of people have the same problem so I thought one of you would know a trick or a package that would help speeding this up a lot. (we work on LINUX Red Hat R 2.10.0 but I guess this is of no use for this pb) Thanks for reading this. Have a nice week end -- View this message in context: http://r.789695.n4.nabble.com/efficient-equivalent-to-read-csv-write-csv-tp2714325p2714325.html Sent from the R help mailing list archive at Nabble.com.
On Sun, Sep 26, 2010 at 8:38 AM, statquant2 <statquant at gmail.com> wrote:> > Hello everyone, > I currently run R code that have to read 100 or more large csv files (>= 100 > Mo), and usually write csv too. > My collegues and I like R very much but are a little bit ashtonished by how > slow those functions are. We have looked on every argument of those > functions and if specifying some parameters help a bit, this is still too > slow. > I am sure a lot of people have the same problem so I thought one of you > would know a trick or a package that would help speeding this up a lot. > > (we work on LINUX Red Hat R 2.10.0 but I guess this is of no use for this > pb) > > Thanks for reading this. > Have a nice week endYou could try read.csv.sql in the sqldf package: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql See ?read.csv.sql in sqldf. It uses RSQLite and SQLite to read the file into an sqlite database (which it sets up for you) completely bypassing R and from there grabs it into R removing the database it created at the end. There are also CSVREAD and CSVWRITE sql functions in the H2 database which is also supported by sqldf although I have never checked their speed: http://code.google.com/p/sqldf/#10.__What_are_some_of_the_differences_between_using_SQLite_and_H -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On 26.09.2010 14:38, statquant2 wrote:> > Hello everyone, > I currently run R code that have to read 100 or more large csv files (>= 100 > Mo), and usually write csv too. > My collegues and I like R very much but are a little bit ashtonished by how > slow those functions are. We have looked on every argument of those > functions and if specifying some parameters help a bit, this is still too > slow. > I am sure a lot of people have the same problem so I thought one of you > would know a trick or a package that would help speeding this up a lot. > > (we work on LINUX Red Hat R 2.10.0 but I guess this is of no use for this > pb) > > Thanks for reading this. > Have a nice week endMost of us read the csv file and write an Rdata file at once (see ?save). Then we can read in the data much quicker after they have been imported once with read.csv and friends. Uwe Ligges
Hi, after testing R) system.time(read.csv("myfile.csv")) user system elapsed 1.126 0.038 1.177 R) system.time(read.csv.sql("myfile.csv")) user system elapsed 1.405 0.025 1.439 Warning messages: 1: closing unused connection 4 () 2: closing unused connection 3 () It seems that the function is less efficient that the base one ... so ... -- View this message in context: http://r.789695.n4.nabble.com/efficient-equivalent-to-read-csv-write-csv-tp2714325p2717585.html Sent from the R help mailing list archive at Nabble.com.
On Tue, Sep 28, 2010 at 1:24 PM, statquant2 <statquant at gmail.com> wrote:> > Hi, after testing > R) system.time(read.csv("myfile.csv")) > ? user ?system elapsed > ?1.126 ? 0.038 ? 1.177 > > R) system.time(read.csv.sql("myfile.csv")) > ? user ?system elapsed > ?1.405 ? 0.025 ? 1.439 > Warning messages: > 1: closing unused connection 4 () > 2: closing unused connection 3 () > > It seems that the function is less efficient that the base one ... so ...The benefit comes with larger files. With small files there is not much point in speeding it up since the absolute time is already small. Suggest you look at the benchmarks on the sqldf home page where a couple of users benchmarked larger files. Since sqldf was intended for convenience and not really performance I was surprised as anyone when several users independently noticed that sqldf ran several times faster than unoptimized R code. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com