On Thu, Jul 14, 2011 at 10:33 AM, Mandans <mandans_p at yahoo.com>
wrote:> SQldf with sqlite and H2
>
> I have a large csv file (about 2GB) and wanted to import the file into R
and do some filtering and analysis. Came across sqldf ( a great idea and
product) and was trying to play around to see what would be the best method of
doing this. csv file is comma delimited with some columns having comma inside
the quoation like this "John, Doe".
>
> I tried this first
>
> #######
> library(sqldf)
> sqldf("attach testdb as new")
> In.File <- "C:/JP/Temp/2008.csv"
> read.csv.sql(In.File, sql = "create table table1 as select * from
file",
> ?dbname = "testdb")
>
> It errored out with message
>
> NULL
> Warning message:
> closing unused connection 3 (C:/JP/Temp/2008.csv)
>
> When this failed, I converted this file from comma delimited to tab
delimited and used this command
>
> #########
> read.csv.sql(In.File, sql = "create table table1 as select * from
file",
> ?dbname = "testdb", sep = "\t")
>
> and this worked, it created testdb sqlite file with the size of 3GB
>
> now my question is in 3 parts.
>
> 1. Is it possible to create a dataframe with appropriate column classes and
use that column classes when I use the read.csv.sql command to create the table.
Something like may be create the table from that DF and then update with
read.csv.sql.?
>
> Any example code will be really helpful.
Here is an example of using method = "name__class". Note there are
two underscores in a row. It appears I neglected to document that
Date2 means convert from character representation whereas Date means
convert from numeric representation. It would also be possible to use
method = "raw" and then coerce the columns yourself afterwards.
# create test file
Lines <- 'A__Date2|B
2000-01-01|x,y
2000-01-02|c,d
'
tf <- tempfile()
cat(Lines, file = tf)
library(sqldf)
DF <- read.csv.sql(tf, sep = "|", method = "name__class")
str(DF)
>
> 2. If we use the H2 database instead of default sqlite and use the readcsv
option, will that be faster and is there a way we can specify the above thought
of applying a DF class to table column properties and update with CSVREAD
>
> library(RH2)
> something like SELECT * FROM CSVREAD('C:/JP/Temp/2008.csv')
>
> Any example code will be really helpful.
Sorry, I haven't tested the speed of this. postgresql and mysql, both
supported by sqldf, also have builtin methods to read files. If I had
to guess I would guess that mysql would be fastest but this would have
to be tested.
>
> 3. How do we specify where the H2 file is saved. Saw something like this,
when I ran this example from RH2 package, couldn't find the file in the
working directory.
>
> con <- dbConnect(H2(), "jdbc:h2:~/test", "sa",
"")
~ means your home directory so ~/test means test is in the home directory.
Try
normalizePath("~")
normalizePath("~/test")
etc.
to see what they refer to.
Regards.
>
> Sorry for the long mail. Appreciate all for building a great community and
for the wonderful software in R.
> Thanks for Gabor Grothendieck for bring sqldf to this great community.
>
> Any help or direction you can provide in this is highly appreciated.
>
> Thanks all.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com