Hi all, I am attempting to learn SQL through sqldf... One task I am particularly interested in is merging separate (presumably large) files into a single table without loading these files into R as an intermediate step (by loading them into SQLite and merging them there). Taking a step back, I've considered these alternatives: 1) I know if I use straight SQLite commands I might use the 'IMPORT' or 'INSERT INTO' command, which is not terribly flexible... (I can read large files line-by-line in Python and use the 'INSERT INTO' command, which is reasonably fast; I could do this in R as well but my experience with R's input/output is that it's much slower...? and sometimes setting up the table column definitions can be tedious if there are many variables). 2) dbWriteTable() with append=TRUE is very convenient except that it requires I load the data into R first... 3) sqldf's capability to put data directly into a database is something I'd like to work out. So in this case I have a series of tab-delimited text file with the first line being a header. For some reason I cannot seem to get it working. Combining examples 6 and 9 from the Google Code page (and R-help archives), I tried source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") (do I need it for SQLite?) ## sqldf("attach 'mydb.db' as new") f <- file("myexample.txt") attr(f,"file.format") <- list(header=TRUE,sep="\t") sqldf("create table myexample as select * from f", stringsAsFactors=FALSE, dbname="mydb.db") ## or f <- file(fi) sqldf("create table myexample as select * from f", stringsAsFactors=FALSE,file.format=list(header=TRUE,sep="\t"), dbname="mydb.db") ## sqldf("select * from myexample",dbname="mydb.db") gives me tables with 0 rows and 0 columns... So in any case I have a few questions: === 1 ===Would this be scalable to files with few GBs of data in them (I guess I am uncertain of the underlying mechanism for transporting data from the .txt file to the .db file... I see there is a call the dbWriteTable() internally in sqldf but through the connection)? And is there anything obviously doing wrong above? === 2 === Since I cannot 'append' rows to existing tables in SQLite (or any SQL database), I think what I would have to do is to load each of the files into the database and then 'rbind' them into a new table using 'union all'? The following code is what I have (the part where I read in each file before dumping into the database file would hopefully be replaced by the method above, if it can be made to work). ## (1) create a database file sqldf("attach 'alltables.db' as new") ## (2) convenience function sql <- function(...) sqldf(...,dbname="alltables.db") ## (3) load data as separate tables for( fi in list.files(".","txt$") ) { mydata <- read.delim(fi) sql(sprintf("create table %s as select * from mydata",sub("\\.txt","",fi))) } rm(fi,mydata) ## (4) merge tables tablenames <- as.character(sql("select * from sqlite_master")$name) sql(paste("create table myfulltable as", paste(sprintf("select * from %s",tablenames), collapse=" union all "))) ## (5) delete separate tables since we have a merged one for( nm in tablenames ) sql(sprintf("drop table %s",nm)) === 3 ==The following command sqldf("attach 'mydb.db' as new") DF <- read.delim(fi) sqldf("create table myexample as select * from DF",dbname="mydb.db") will usually create a .db file twice the size of the .txt file (for now I am playing with a files ~500KB so the .db files are around ~1MB). When I create a .db file using SQLite's 'import' command, RSQLite's dbWriteTable(), or inserting values from the same .txt file from Python's SQLite interface, I get .db files that are approximately the same size as the .txt file (~500KB). Is the larger file size for sqldf's method expected? Many thanks in advance! Stephen Tucker
Have just added an example 12 on the home page: http://sqldf.googlecode.com that shows an example. Note use of notation main.mytable to refer to an existing table in the main database (as opposed to a data frame in R). On Thu, Feb 19, 2009 at 11:55 PM, Stephen Tucker <brown_emu at yahoo.com> wrote:> Hi all, > > I am attempting to learn SQL through sqldf... > > One task I am particularly interested in is merging separate > (presumably large) files into a single table without loading these > files into R as an intermediate step (by loading them into SQLite and > merging them there). > > Taking a step back, I've considered these alternatives: > > 1) I know if I use straight SQLite commands I might use the 'IMPORT' > or 'INSERT INTO' command, which is not terribly flexible... (I can > read large files line-by-line in Python and use the 'INSERT INTO' > command, which is reasonably fast; I could do this in R as well but my > experience with R's input/output is that it's much slower...? and > sometimes setting up the table column definitions can be tedious if > there are many variables). > > 2) dbWriteTable() with append=TRUE is very convenient except that it > requires I load the data into R first... > > 3) sqldf's capability to put data directly into a database is > something I'd like to work out. > > So in this case I have a series of tab-delimited text file with the > first line being a header. > > For some reason I cannot seem to get it working. Combining examples 6 > and 9 from the Google Code page (and R-help archives), I tried > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > (do I need it for SQLite?) > ## > sqldf("attach 'mydb.db' as new") > f <- file("myexample.txt") > attr(f,"file.format") <- list(header=TRUE,sep="\t") > sqldf("create table myexample as select * from f", > stringsAsFactors=FALSE, > dbname="mydb.db") > ## or > f <- file(fi) > sqldf("create table myexample as select * from f", > stringsAsFactors=FALSE,file.format=list(header=TRUE,sep="\t"), > dbname="mydb.db") > ## > sqldf("select * from myexample",dbname="mydb.db") > gives me tables with 0 rows and 0 columns... > > So in any case I have a few questions: > > === 1 ===> Would this be scalable to files with few GBs of data in them (I guess > I am uncertain of the underlying mechanism for transporting data from > the .txt file to the .db file... I see there is a call the > dbWriteTable() internally in sqldf but through the connection)? And > is there anything obviously doing wrong above? > > === 2 ==> Since I cannot 'append' rows to existing tables in SQLite (or any SQL > database), I think what I would have to do is to load each of the > files into the database and then 'rbind' them into a new table using > 'union all'? The following code is what I have (the part where I read > in each file before dumping into the database file would hopefully be > replaced by the method above, if it can be made to work). > > ## (1) create a database file > sqldf("attach 'alltables.db' as new") > ## (2) convenience function > sql <- function(...) sqldf(...,dbname="alltables.db") > ## (3) load data as separate tables > for( fi in list.files(".","txt$") ) { > mydata <- read.delim(fi) > sql(sprintf("create table %s as select * from mydata",sub("\\.txt","",fi))) > } > rm(fi,mydata) > ## (4) merge tables > tablenames <- as.character(sql("select * from sqlite_master")$name) > sql(paste("create table myfulltable as", > paste(sprintf("select * from %s",tablenames), > collapse=" union all "))) > ## (5) delete separate tables since we have a merged one > for( nm in tablenames ) sql(sprintf("drop table %s",nm)) > > === 3 ==> The following command > sqldf("attach 'mydb.db' as new") > DF <- read.delim(fi) > sqldf("create table myexample as select * from DF",dbname="mydb.db") > > will usually create a .db file twice the size of the .txt file (for > now I am playing with a files ~500KB so the .db files are around > ~1MB). When I create a .db file using SQLite's 'import' command, > RSQLite's dbWriteTable(), or inserting values from the same .txt file > from Python's SQLite interface, I get .db files that are approximately > the same size as the .txt file (~500KB). Is the larger file size for > sqldf's method expected? > > Many thanks in advance! > > Stephen Tucker > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Thanks yet another time, Gabor - I think I am slowly understanding - particularly I was confused by persistence of connections. So starting with some parts of your example 12, ## gc() rm(list=ls()) unlink("mydb") sqldf("attach 'mydb' as new") irishead <- file("irishead.dat") iristail <- file("iristail.dat") If I just wanted to merge the two files within SQL and return some part of the result, I would do sqldf('select count(*) from (select * from irishead union select * from iristail)',dbname="mydb") and the tables exist in mydb only for the duration of the computation> sqldf("select * from sqlite_master",dbname="mydb")$nameNULL (but why is the size of mydb > 0 afterwards, if it contains no tables...?) ...is the above the same as sqldf('select count(*) from (select * from irishead union select * from iristail)',dbname=tempfile()) except that I don't create 'mydb'? If I wanted to save the merged table (for use in a later session): sqldf('create table fulliris as select * from irishead union select * from iristail',dbname="mydb")> sqldf("select * from sqlite_master",dbname="mydb")$name[1] fulltable Levels: fulltable If I want copies of all three tables, sqldf(dbname="mydb") sqldf('create table fulltable as select * from irishead union select * from iristail') sqldf()> sqldf("select * from sqlite_master",dbname="mydb")$name[1] irishead iristail fulltable Levels: fulltable irishead iristail ? ...I'll try to go figure a few more things out in the in the meantime (like using sep="\t" ?) and using connections with sqldf(). But thanks for the help! Stephen ----- Original Message ---- From: Gabor Grothendieck <ggrothendieck at gmail.com> To: Stephen Tucker <brown_emu at yahoo.com> Cc: R-help <r-help at stat.math.ethz.ch> Sent: Friday, February 20, 2009 5:22:09 AM Subject: Re: [R] importing data to SQLite database with sqldf Have just added an example 12 on the home page: http://sqldf.googlecode.com that shows an example. Note use of notation main.mytable to refer to an existing table in the main database (as opposed to a data frame in R). On Thu, Feb 19, 2009 at 11:55 PM, Stephen Tucker <brown_emu at yahoo.com> wrote:> Hi all, > > I am attempting to learn SQL through sqldf... > > One task I am particularly interested in is merging separate > (presumably large) files into a single table without loading these > files into R as an intermediate step (by loading them into SQLite and > merging them there). > > Taking a step back, I've considered these alternatives: > > 1) I know if I use straight SQLite commands I might use the 'IMPORT' > or 'INSERT INTO' command, which is not terribly flexible... (I can > read large files line-by-line in Python and use the 'INSERT INTO' > command, which is reasonably fast; I could do this in R as well but my > experience with R's input/output is that it's much slower...? and > sometimes setting up the table column definitions can be tedious if > there are many variables). > > 2) dbWriteTable() with append=TRUE is very convenient except that it > requires I load the data into R first... > > 3) sqldf's capability to put data directly into a database is > something I'd like to work out. > > So in this case I have a series of tab-delimited text file with the > first line being a header. > > For some reason I cannot seem to get it working. Combining examples 6 > and 9 from the Google Code page (and R-help archives), I tried > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > (do I need it for SQLite?) > ## > sqldf("attach 'mydb.db' as new") > f <- file("myexample.txt") > attr(f,"file.format") <- list(header=TRUE,sep="\t") > sqldf("create table myexample as select * from f", > stringsAsFactors=FALSE, > dbname="mydb.db") > ## or > f <- file(fi) > sqldf("create table myexample as select * from f", > stringsAsFactors=FALSE,file.format=list(header=TRUE,sep="\t"), > dbname="mydb.db") > ## > sqldf("select * from myexample",dbname="mydb.db") > gives me tables with 0 rows and 0 columns... > > So in any case I have a few questions: > > === 1 ===> Would this be scalable to files with few GBs of data in them (I guess > I am uncertain of the underlying mechanism for transporting data from > the .txt file to the .db file... I see there is a call the > dbWriteTable() internally in sqldf but through the connection)? And > is there anything obviously doing wrong above? > > === 2 ==> Since I cannot 'append' rows to existing tables in SQLite (or any SQL > database), I think what I would have to do is to load each of the > files into the database and then 'rbind' them into a new table using > 'union all'? The following code is what I have (the part where I read > in each file before dumping into the database file would hopefully be > replaced by the method above, if it can be made to work). > > ## (1) create a database file > sqldf("attach 'alltables.db' as new") > ## (2) convenience function > sql <- function(...) sqldf(...,dbname="alltables.db") > ## (3) load data as separate tables > for( fi in list.files(".","txt$") ) { > mydata <- read.delim(fi) > sql(sprintf("create table %s as select * from mydata",sub("\\.txt","",fi))) > } > rm(fi,mydata) > ## (4) merge tables > tablenames <- as.character(sql("select * from sqlite_master")$name) > sql(paste("create table myfulltable as", > paste(sprintf("select * from %s",tablenames), > collapse=" union all "))) > ## (5) delete separate tables since we have a merged one > for( nm in tablenames ) sql(sprintf("drop table %s",nm)) > > === 3 ==> The following command > sqldf("attach 'mydb.db' as new") > DF <- read.delim(fi) > sqldf("create table myexample as select * from DF",dbname="mydb.db") > > will usually create a .db file twice the size of the .txt file (for > now I am playing with a files ~500KB so the .db files are around > ~1MB). When I create a .db file using SQLite's 'import' command, > RSQLite's dbWriteTable(), or inserting values from the same .txt file > from Python's SQLite interface, I get .db files that are approximately > the same size as the .txt file (~500KB). Is the larger file size for > sqldf's method expected? > > Many thanks in advance! > > Stephen Tucker > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >