thr3ads.net - R help - [R] How to Read a Large CSV into a Database with R [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Anthony Damico

2010-Nov-15 15:07 UTC

[R] How to Read a Large CSV into a Database with R

Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm
trying to
insert a very large CSV file into a SQLite database.  I'm pretty new to
working with databases in R, so I apologize if I'm overlooking something
obvious here.

I'm trying to work with the American Community Survey data, which is two
1.3GB csv files.  I have enough RAM to read one of them into memory, but not
both at the same time.  So, in order to analyze them, I'm trying to get them
into a SQLite database so I can use the R survey package's database-backed
survey objects capabilities (
http://faculty.washington.edu/tlumley/survey/svy-dbi.html).

I need to combine both of these CSV files into one table (within a
database), so I think that I'd need a SQL manipulation technique that reads
everything line by line, instead of pulling it all into memory.

I've tried using read.csv.sql, but it finishes without an error and then
only shows me the table structure when I run the final select statement.
When I run these exact same commands on a smaller CSV file, they work fine.
I imagine this is not working because the csv is so large, but I'm not sure
how to confirm that or what to change if it is.  I do want to get all
columns from the CSV into the data table, so I don't want to filter
anything.

library(sqldf)
setwd("R:\\American Community Survey\\Data\\2009")
sqldf("attach 'sqlite' as new")
read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as
select * from
file" , dbname="sqlite")
sqldf("select * from ss09pusa limit 3",dbname="sqlite")


I've also tried using the SQL IMPORT command, which I couldn't get
working
properly, even on a tiny two-field, five-row CSV file.

library(RSQLite)
setwd("R:\\American Community Survey\\Data\\2009")
in_csv <- file("test.csv")
out_db <- dbConnect(SQLite(), dbname="sqlite.db")
dbGetQuery(out_db , "create table test (hello integer, world text)")
dbGetQuery(out_db , "import in_csv test")


Any advice would be sincerely appreciated.

Thanks!

Anthony Damico
Kaiser Family Foundation

	[[alternative HTML version deleted]]

Dieter Menne

2010-Nov-15 15:18 UTC

head link

[R] How to Read a Large CSV into a Database with R

Anthony-107 wrote:> 
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm
trying
> to
> insert a very large CSV file into a SQLite database. 
> 
Better use an external utility if this is a one-time import for this job:

http://sqlitebrowser.sourceforge.net/


Dieter

-- 
View this message in context:
http://r.789695.n4.nabble.com/How-to-Read-a-Large-CSV-into-a-Database-with-R-tp3043209p3043226.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2010-Nov-15 15:41 UTC

head link

[R] How to Read a Large CSV into a Database with R

On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico <ajdamico at gmail.com>
wrote:> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. ?I'm
trying to
> insert a very large CSV file into a SQLite database. ?I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking
something
> obvious here.
>
> I'm trying to work with the American Community Survey data, which is
two
> 1.3GB csv files. ?I have enough RAM to read one of them into memory, but
not
> both at the same time. ?So, in order to analyze them, I'm trying to get
them
> into a SQLite database so I can use the R survey package's
database-backed
> survey objects capabilities (
> http://faculty.washington.edu/tlumley/survey/svy-dbi.html).
>
> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that
reads
> everything line by line, instead of pulling it all into memory.
>
> I've tried using read.csv.sql, but it finishes without an error and
then
> only shows me the table structure when I run the final select statement.
> When I run these exact same commands on a smaller CSV file, they work fine.
> I imagine this is not working because the csv is so large, but I'm not
sure
> how to confirm that or what to change if it is. ?I do want to get all
> columns from the CSV into the data table, so I don't want to filter
> anything.
>
> library(sqldf)
> setwd("R:\\American Community Survey\\Data\\2009")
> sqldf("attach 'sqlite' as new")
> read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as
select * from
> file" , dbname="sqlite")
> sqldf("select * from ss09pusa limit 3",dbname="sqlite")
>
What the above code does, which is unlikely to be what you intended,
is to create an sqlite database called 'sqlite' and then read in the
indicated file into sqlite, read it in into R from sqlite (clearly
this step will fail if the data is too big for R but it its not then
you are ok) and then delete the table from the database so your sqldf
statement should give an error since there is no such table or else if
you have a data frame in your R workspace called ss09pusa the sqldf
statement will load that into a database table and the retrieve its
first three rows and then delete the table.

This sort of task is probably more suitable for RSQLite than sqldf
but if you wish to do it with sqldf you need to follow example 9 or
example 10 on the sqldf home page:

In example 9,

http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases

its very important to note that sqldf automatically deletes any table
that it created after the sqldf or read.csv.sql statement is done so
to not have the table dropped is to make sure you issue an sql
statement that creates the table, "create table mytab as select ..."
rather than sqldf.

In example 10,

http://code.google.com/p/sqldf/#Example_10._Persistent_Connections

persistent connections are illustrated which represents an alternate
way to do this in sqldf.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Lee Hachadoorian

2010-Nov-15 22:08 UTC

head link

[R] How to Read a Large CSV into a Database with R

On Mon, 15 Nov 2010 13:28:40 -0500, Anthony Damico wrote:
> Do you have any other ideas as to how I might diagnose what's going on
> here?  Or, alternatively, is there some workaround that would get this
giant
> CSV into a database?  If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,
> but I don't know that it's compatible with DBI on Windows.
>
> Thanks again!
> Anthony
If you are able to successfully read in one entire 1.2 GB file at a
time, I would skip
sqldf and do this (after setting the working directory appropriately):

connSQLite = dbConnect(dbDriver("SQLite"),
  dbname = "acs")

ss09pusa = read.csv("ss09pusa.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusa)
rm(ss09pusa)

ss09pusb = read.csv("ss09pusb.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusb, append = TRUE)
rm(ss09pusb)

#Test select of specific columns
sql = "SELECT ST, PUMA, ADJINC FROM acs2009 WHERE ST = 33"
dfIncome = dbGetQuery(connSQLite, sql)

I was *not* able to load one entire table at a time, so I was able to make
it work by combining sqldf to read in chunks
at a time and dbWriteTable from RSQLite to write each chunk to the
database. The read would then look like, for example:

ss09pusa = read.csv.sql("ss09pusa.csv",
  sql = paste("SELECT * FROM file WHERE ST =", i))

where i is an iteration over the state FIPS codes. (You could just use 1
to 56, even though there's a few missing numbers in there. The searches
for nonexistent records will take time but will otherwise be harmless.)
The dbWriteTable would be the same, with every write after the first one
using append = TRUE.

Also, I assume you will want to do the read from csv / write to SQLite
*once*, then maintain the SQLite database for other sessions. We also use
large census and IPUMS data sets, and we keep everything in a Postgres
backend, which we access with RPostgreSQL. Much easier to keep
everything organized in an RDBMS than to pass
around these monstrous csv files.

Seth Falcon

2010-Nov-16 04:53 UTC

head link

[R] How to Read a Large CSV into a Database with R

Hi Anthony,

On Mon, Nov 15, 2010 at 7:07 AM, Anthony Damico <ajdamico at gmail.com>
wrote:> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. ?I'm
trying to
> insert a very large CSV file into a SQLite database. ?I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking
something
> obvious here.
Working with large data and doing operations in bounded memory tends
not to be among the most obvious things to accomplish in R for new
comers.
> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that
reads
> everything line by line, instead of pulling it all into memory.
Yes, that sounds like the right approach.  Below is some sample code
that does this and avoids reading the entire data set into memory.

The key points are to use read.table on a file connection and to read
in a batch of lines at a time.  Each batch is inserted into the db
using a prepared query.  The example is complicated a bit because
read.table on a file connection raises an error if no lines are
available, so I chose to use tryCatch to handle that.  A cleaner
approach might be to check the number of rows read and break out of
the while loop if the count is less than the batch size.

Anyhow, see if this approach works for you.


library("RSQLite")

file_list <- c("ss09pusa.csv", "ss09pusb.csv")
input <- file(file_list[1], "r")
db <- dbConnect(SQLite(), dbname="example.sqlite")

header <- readLines(input, n = 1)
fields <- strsplit(header, ",")[[1]]
colTypes <- rep("TEXT", length(fields))
colDecl <- paste(fields, colTypes)
sql <- sprintf("CREATE TABLE ss09 (%s)", paste(colDecl, collapse =
", "))
dbGetQuery(db, sql)

colClasses <- rep("character", length(fields))
sql.in <- sprintf("INSERT INTO ss09 VALUES (%s)",
                  paste(rep("?", length(fields)), collapse =
","))
chunk_size <- 250000
dbBeginTransaction(db)
tryCatch({
    while (TRUE) {
        part <- read.table(input, nrows=chunk_size, sep=",",
                           colClasses = colClasses,
                           comment.char = "")
        dbGetPreparedQuery(db, sql.in, bind.data = part)
    }
}, error = function(e) {
    if (grepl("no lines available", conditionMessage(e)))
        TRUE
    else
        stop(conditionMessage(e))
})
dbCommit(db)



-- 
Seth Falcon | @sfalcon | http://userprimary.net/

Seth Falcon

2010-Nov-16 17:41 UTC

head link

[R] How to Read a Large CSV into a Database with R

Hi Abhijit,

[I've cc'd R-help to keep the discussion on the list]

On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta
<adasgupta at araastat.com> wrote:> Seth,
>
> I was looking for something like this too. I've a question. If
> you're reading the data from a connection, does R start reading the
> next chunk of data right after the previous chunk, or do we need to
> keep track of things using "skip"
The purpose of using a file connection is to allow R to keep its place
in the file as it reads and not have to re-read or skip.  This is
considerably more efficient.



-- 
Seth Falcon | @sfalcon | http://userprimary.net/

Maybe Matching Threads

Search for more maybe matching threads

R help - Nov 2010 - How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

[R] How to Read a Large CSV into a Database with R

Maybe Matching Threads