Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying to insert a very large CSV file into a SQLite database. I'm pretty new to working with databases in R, so I apologize if I'm overlooking something obvious here. I'm trying to work with the American Community Survey data, which is two 1.3GB csv files. I have enough RAM to read one of them into memory, but not both at the same time. So, in order to analyze them, I'm trying to get them into a SQLite database so I can use the R survey package's database-backed survey objects capabilities ( http://faculty.washington.edu/tlumley/survey/svy-dbi.html). I need to combine both of these CSV files into one table (within a database), so I think that I'd need a SQL manipulation technique that reads everything line by line, instead of pulling it all into memory. I've tried using read.csv.sql, but it finishes without an error and then only shows me the table structure when I run the final select statement. When I run these exact same commands on a smaller CSV file, they work fine. I imagine this is not working because the csv is so large, but I'm not sure how to confirm that or what to change if it is. I do want to get all columns from the CSV into the data table, so I don't want to filter anything. library(sqldf) setwd("R:\\American Community Survey\\Data\\2009") sqldf("attach 'sqlite' as new") read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from file" , dbname="sqlite") sqldf("select * from ss09pusa limit 3",dbname="sqlite") I've also tried using the SQL IMPORT command, which I couldn't get working properly, even on a tiny two-field, five-row CSV file. library(RSQLite) setwd("R:\\American Community Survey\\Data\\2009") in_csv <- file("test.csv") out_db <- dbConnect(SQLite(), dbname="sqlite.db") dbGetQuery(out_db , "create table test (hello integer, world text)") dbGetQuery(out_db , "import in_csv test") Any advice would be sincerely appreciated. Thanks! Anthony Damico Kaiser Family Foundation [[alternative HTML version deleted]]
Anthony-107 wrote:> > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying > to > insert a very large CSV file into a SQLite database. >Better use an external utility if this is a one-time import for this job: http://sqlitebrowser.sourceforge.net/ Dieter -- View this message in context: http://r.789695.n4.nabble.com/How-to-Read-a-Large-CSV-into-a-Database-with-R-tp3043209p3043226.html Sent from the R help mailing list archive at Nabble.com.
Gabor Grothendieck
2010-Nov-15 15:41 UTC
[R] How to Read a Large CSV into a Database with R
On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. ?I'm trying to > insert a very large CSV file into a SQLite database. ?I'm pretty new to > working with databases in R, so I apologize if I'm overlooking something > obvious here. > > I'm trying to work with the American Community Survey data, which is two > 1.3GB csv files. ?I have enough RAM to read one of them into memory, but not > both at the same time. ?So, in order to analyze them, I'm trying to get them > into a SQLite database so I can use the R survey package's database-backed > survey objects capabilities ( > http://faculty.washington.edu/tlumley/survey/svy-dbi.html). > > I need to combine both of these CSV files into one table (within a > database), so I think that I'd need a SQL manipulation technique that reads > everything line by line, instead of pulling it all into memory. > > I've tried using read.csv.sql, but it finishes without an error and then > only shows me the table structure when I run the final select statement. > When I run these exact same commands on a smaller CSV file, they work fine. > I imagine this is not working because the csv is so large, but I'm not sure > how to confirm that or what to change if it is. ?I do want to get all > columns from the CSV into the data table, so I don't want to filter > anything. > > library(sqldf) > setwd("R:\\American Community Survey\\Data\\2009") > sqldf("attach 'sqlite' as new") > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > file" , dbname="sqlite") > sqldf("select * from ss09pusa limit 3",dbname="sqlite") >What the above code does, which is unlikely to be what you intended, is to create an sqlite database called 'sqlite' and then read in the indicated file into sqlite, read it in into R from sqlite (clearly this step will fail if the data is too big for R but it its not then you are ok) and then delete the table from the database so your sqldf statement should give an error since there is no such table or else if you have a data frame in your R workspace called ss09pusa the sqldf statement will load that into a database table and the retrieve its first three rows and then delete the table. This sort of task is probably more suitable for RSQLite than sqldf but if you wish to do it with sqldf you need to follow example 9 or example 10 on the sqldf home page: In example 9, http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases its very important to note that sqldf automatically deletes any table that it created after the sqldf or read.csv.sql statement is done so to not have the table dropped is to make sure you issue an sql statement that creates the table, "create table mytab as select ..." rather than sqldf. In example 10, http://code.google.com/p/sqldf/#Example_10._Persistent_Connections persistent connections are illustrated which represents an alternate way to do this in sqldf. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Mon, 15 Nov 2010 13:28:40 -0500, Anthony Damico wrote:> Do you have any other ideas as to how I might diagnose what's going on > here? Or, alternatively, is there some workaround that would get this giant > CSV into a database? If you think there's a reasonable way to use the > IMPORT command with RSQLite, that seems like it would import the fastest, > but I don't know that it's compatible with DBI on Windows. > > Thanks again! > AnthonyIf you are able to successfully read in one entire 1.2 GB file at a time, I would skip sqldf and do this (after setting the working directory appropriately): connSQLite = dbConnect(dbDriver("SQLite"), dbname = "acs") ss09pusa = read.csv("ss09pusa.csv", header = TRUE) dbWriteTable(connSQLite, "acs2009", ss09pusa) rm(ss09pusa) ss09pusb = read.csv("ss09pusb.csv", header = TRUE) dbWriteTable(connSQLite, "acs2009", ss09pusb, append = TRUE) rm(ss09pusb) #Test select of specific columns sql = "SELECT ST, PUMA, ADJINC FROM acs2009 WHERE ST = 33" dfIncome = dbGetQuery(connSQLite, sql) I was *not* able to load one entire table at a time, so I was able to make it work by combining sqldf to read in chunks at a time and dbWriteTable from RSQLite to write each chunk to the database. The read would then look like, for example: ss09pusa = read.csv.sql("ss09pusa.csv", sql = paste("SELECT * FROM file WHERE ST =", i)) where i is an iteration over the state FIPS codes. (You could just use 1 to 56, even though there's a few missing numbers in there. The searches for nonexistent records will take time but will otherwise be harmless.) The dbWriteTable would be the same, with every write after the first one using append = TRUE. Also, I assume you will want to do the read from csv / write to SQLite *once*, then maintain the SQLite database for other sessions. We also use large census and IPUMS data sets, and we keep everything in a Postgres backend, which we access with RPostgreSQL. Much easier to keep everything organized in an RDBMS than to pass around these monstrous csv files.
Hi Anthony, On Mon, Nov 15, 2010 at 7:07 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. ?I'm trying to > insert a very large CSV file into a SQLite database. ?I'm pretty new to > working with databases in R, so I apologize if I'm overlooking something > obvious here.Working with large data and doing operations in bounded memory tends not to be among the most obvious things to accomplish in R for new comers.> I need to combine both of these CSV files into one table (within a > database), so I think that I'd need a SQL manipulation technique that reads > everything line by line, instead of pulling it all into memory.Yes, that sounds like the right approach. Below is some sample code that does this and avoids reading the entire data set into memory. The key points are to use read.table on a file connection and to read in a batch of lines at a time. Each batch is inserted into the db using a prepared query. The example is complicated a bit because read.table on a file connection raises an error if no lines are available, so I chose to use tryCatch to handle that. A cleaner approach might be to check the number of rows read and break out of the while loop if the count is less than the batch size. Anyhow, see if this approach works for you. library("RSQLite") file_list <- c("ss09pusa.csv", "ss09pusb.csv") input <- file(file_list[1], "r") db <- dbConnect(SQLite(), dbname="example.sqlite") header <- readLines(input, n = 1) fields <- strsplit(header, ",")[[1]] colTypes <- rep("TEXT", length(fields)) colDecl <- paste(fields, colTypes) sql <- sprintf("CREATE TABLE ss09 (%s)", paste(colDecl, collapse = ", ")) dbGetQuery(db, sql) colClasses <- rep("character", length(fields)) sql.in <- sprintf("INSERT INTO ss09 VALUES (%s)", paste(rep("?", length(fields)), collapse = ",")) chunk_size <- 250000 dbBeginTransaction(db) tryCatch({ while (TRUE) { part <- read.table(input, nrows=chunk_size, sep=",", colClasses = colClasses, comment.char = "") dbGetPreparedQuery(db, sql.in, bind.data = part) } }, error = function(e) { if (grepl("no lines available", conditionMessage(e))) TRUE else stop(conditionMessage(e)) }) dbCommit(db) -- Seth Falcon | @sfalcon | http://userprimary.net/
Hi Abhijit, [I've cc'd R-help to keep the discussion on the list] On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta <adasgupta at araastat.com> wrote:> Seth, > > I was looking for something like this too. I've a question. If > you're reading the data from a connection, does R start reading the > next chunk of data right after the previous chunk, or do we need to > keep track of things using "skip"The purpose of using a file connection is to allow R to keep its place in the file as it reads and not have to re-read or skip. This is considerably more efficient. -- Seth Falcon | @sfalcon | http://userprimary.net/