thr3ads.net - R help - [R] sqldf for Very Large Tab Delimited Files [Feb 2012]

If this information is useful, please help other people find it:
Share via:

2012-Feb-02 04:57 UTC

[R] sqldf for Very Large Tab Delimited Files

Hi All,

I have a very (very) large tab-delimited text file without headers. There
are only 8 columns and millions of rows. I want to make numerous pieces of
this file by sub-setting it for individual stations. Station is given as in
the first column. I am trying to learn and use sqldf package for this but am
stuck in a couple of places.

To simulate my requirement, I have taken iris dataset as an example and have
done the following:
(1) create a tab-delimited file without headers.
(2) read it using read.csv.sql command
(3) write the result of a query, getting first 10 records

Here is the reproducible code that I am trying:
# Text data file
write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE,
col.names=FALSE,row.names = FALSE)
# create an empty database (can skip this step if database already exists)
sqldf("attach myTestdbT as new")
f1<-file("irisNoH.txt")
attr(f1, "file.format") <- list(header=FALSE,sep="\t")
# read into table called irisTab in the mytestdb sqlite database
read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as
select *
from file", dbname = "mytestdb")
res1<-sqldf("select * from main.irisTab1 limit 10", dbname =
"mytestdb")
write.table(res1, "iris10.txt", sep = "\t", quote = FALSE,
col.names=FALSE,row.names = FALSE)

# For querying records of a particular species - unresolved problems
#a1<-"virginica"
#attr(f1, "names") <-
c("A1","A2","A3","A4","A5")
#res2<-fn$sqldf("select * from main.irisTab1 where A5 =
'$a1'")

In the above, I am not able to:
(1) assign the names to various columns
(2) query for particular value of a column; in this case for particular
species, say virginica
(3) I guess fn$sqldf can do the job but it requires assigning column names

Any help would be most appreciated.

Thanks
HC




--
View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4350555.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2012-Feb-02 08:11 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

On Wed, Feb 1, 2012 at 11:57 PM, HC <hcatbr at yahoo.co.in>
wrote:> Hi All,
>
> I have a very (very) large tab-delimited text file without headers. There
> are only 8 columns and millions of rows. I want to make numerous pieces of
> this file by sub-setting it for individual stations. Station is given as in
> the first column. I am trying to learn and use sqldf package for this but
am
> stuck in a couple of places.
>
> To simulate my requirement, I have taken iris dataset as an example and
have
> done the following:
> (1) create a tab-delimited file without headers.
> (2) read it using read.csv.sql command
> (3) write the result of a query, getting first 10 records
>
> Here is the reproducible code that I am trying:
> # Text data file
> write.table(iris, "irisNoH.txt", sep = "\t", quote =
FALSE,
> col.names=FALSE,row.names = FALSE)
> # create an empty database (can skip this step if database already exists)
> sqldf("attach myTestdbT as new")
> f1<-file("irisNoH.txt")
> attr(f1, "file.format") <-
list(header=FALSE,sep="\t")
> # read into table called irisTab in the mytestdb sqlite database
> read.csv.sql("irisNoH.txt", sql = "create table
main.irisTab1 as select *
> from file", dbname = "mytestdb")
> res1<-sqldf("select * from main.irisTab1 limit 10", dbname =
"mytestdb")
> write.table(res1, "iris10.txt", sep = "\t", quote =
FALSE,
> col.names=FALSE,row.names = FALSE)
>
> # For querying records of a particular species - unresolved problems
> #a1<-"virginica"
> #attr(f1, "names") <-
c("A1","A2","A3","A4","A5")
> #res2<-fn$sqldf("select * from main.irisTab1 where A5 =
'$a1'")
>
> In the above, I am not able to:
> (1) assign the names to various columns
> (2) query for particular value of a column; in this case for particular
> species, say virginica
> (3) I guess fn$sqldf can do the job but it requires assigning column names
>
> Any help would be most appreciated.
>
Ignoring your iris file for a moment, to query the 5th column (getting
its name via sql rather than via R) we can do this:

library(sqldf)
species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species'
limit 3")

Now, sqldf is best used when you are getting the data from R but if
you want to store it in a database and just leave it there then you
might be better off using RSQLite directly like this (the eol = "\r\n"
in the dbWriteTable statement was needed on my Windows system but you
may not need that depending on your platform):


write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE,
col.names
= FALSE, row.names = FALSE)

library(sqldf)
library(RSQLite)

con <- dbConnect(SQLite(), dbname = "mytestdb")

dbWriteTable(con, "iris", "irisNoH.txt", sep =
"\t", eol = "\r\n")

species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species'
limit 3")

dbDisconnect(con)


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Gabor Grothendieck

2012-Feb-03 12:37 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

On Fri, Feb 3, 2012 at 6:03 AM, HC <hcatbr at yahoo.co.in>
wrote:> Thank you for indicating that SQLite may not handle a file as big as 160
GB.
>
> Would you know of any utility for *physically splitting *the 160 GB text
> file into pieces. And if one can control the splitting at the ?end of a
> record.
>
If they are csv files or similar data files then you could use R or
any scripting language to do that.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

2012-Feb-03 13:08 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

This is a 160 GB tab-separated .txt file. It has 9 columns and 3.25x10^9
rows.

Can R handle it?  

Thank you.
HC



--
View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4354556.html
Sent from the R help mailing list archive at Nabble.com.

Steve Lianoglou

2012-Feb-03 13:59 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

On Fri, Feb 3, 2012 at 7:37 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:> On Fri, Feb 3, 2012 at 6:03 AM, HC <hcatbr at yahoo.co.in> wrote:
>> Thank you for indicating that SQLite may not handle a file as big as
160 GB.
>>
>> Would you know of any utility for *physically splitting *the 160 GB
text
>> file into pieces. And if one can control the splitting at the ?end of a
>> record.
>>
>
> If they are csv files or similar data files then you could use R or
> any scripting language to do that.
Or even the *nix `split` command ...

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Gabor Grothendieck

2012-Feb-03 15:40 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

On Fri, Feb 3, 2012 at 8:08 AM, HC <hcatbr at yahoo.co.in>
wrote:> This is a 160 GB tab-separated .txt file. It has 9 columns and 3.25x10^9
> rows.
>
> Can R handle it?
>
You can process a file N lines at time like this:

con <- file("myfile.dat", "r")
while(length(Lines <- readLines(con, n = N)) > 0) {
  ... whatever...
}

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

2012-Feb-03 18:12 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

Thank you.

The readLines command is working fine and I am able to read 10^6 lines in
one go and write them using the write.table command.

Does this readLines command using a block concept to optimize or goes line
by line?

Steve has mentioned about *nix and split commands. Would there be any speed
benefit as compared to readLines?

Thank you.
HC

--
View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355362.html
Sent from the R help mailing list archive at Nabble.com.

2012-Feb-03 20:22 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

Bad news!

The readLines command works fine upto a certain limit. Once a few files have
been written the R program crashes.

I used the following code:
*************************
iFile<-"Test.txt"
con <- file(iFile, "r")

N<-1250000; 
iLoop<-1
 
while(length(Lines <- readLines(con, n = N)) > 0 & iLoop<41) { 
oFile<-paste("Split_",iLoop,".txt",sep="")
  write.table(Lines, oFile, sep = "\t", quote = FALSE, col.names=
FALSE,
row.names = FALSE)
  iLoop<-iLoop+1
} 
close(con)
********************

With above N=1.25 million, it wrote 28 files of about 57 mb each. That is a
total of about 1.6 GB and then crashed.
I tried with other values on N and it crashes at about the same place in
terms of total size output, i.e., about 1.6 GB.

Is this due to any limitation of Windows 7, in terms of not having the
pointer after this size?

Your insight would be very helpful.

Thank you.
HC






--
View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355679.html
Sent from the R help mailing list archive at Nabble.com.

jim holtman

2012-Feb-04 00:42 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

Exactly what does "crashed" mean?  What was the error message?  How
you tried to put:

rm(Lines)
gc()

at the end of the loop to free up and compact memory?  If you watch
the performance, does the R process seem to be growing in terms of the
amount of memory that is being used?  You can add:

memory.size()

before the above statements to see how much memory is being used.
This is just some more elementary debugging that you will have to
learn when using any system.

On Fri, Feb 3, 2012 at 3:22 PM, HC <hcatbr at yahoo.co.in>
wrote:> Bad news!
>
> The readLines command works fine upto a certain limit. Once a few files
have
> been written the R program crashes.
>
> I used the following code:
> *************************
> iFile<-"Test.txt"
> con <- file(iFile, "r")
>
> N<-1250000;
> iLoop<-1
>
> while(length(Lines <- readLines(con, n = N)) > 0 & iLoop<41) {
> oFile<-paste("Split_",iLoop,".txt",sep="")
> ?write.table(Lines, oFile, sep = "\t", quote = FALSE, col.names=
FALSE,
> row.names = FALSE)
> ?iLoop<-iLoop+1
> }
> close(con)
> ********************
>
> With above N=1.25 million, it wrote 28 files of about 57 mb each. That is a
> total of about 1.6 GB and then crashed.
> I tried with other values on N and it crashes at about the same place in
> terms of total size output, i.e., about 1.6 GB.
>
> Is this due to any limitation of Windows 7, in terms of not having the
> pointer after this size?
>
> Your insight would be very helpful.
>
> Thank you.
> HC
>
>
>
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355679.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

2012-Feb-04 19:28 UTC

head link

[R] sqldf for Very Large Tab Delimited Files

Thank you Jim for your reply.

I could figure out that readLines works fine until 35,841,335 lines
(records). 

When the next line is read to be read, a window with "R for Windows GUI
front-end has stopped working" message comes, with an option to close
program or checking online for a solution.

The tab-separated .txt file I am working on has 9 columns and about 3.25
billion rows. I suspect that the 35,841,336th line is becoming so very big
that RAM out runs its capacity. Perhaps all tabs that are expected after
each values are missing, making the line so very big, I am not sure.

Is there any way to skip this line with readLines or any other function? I
am only hoping that the data after this bad line is good and I can read them
if I can skip the bad one. 

Thank you.
HC

--
View this message in context:
http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4357730.html
Sent from the R help mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more maybe matching threads

R help - Feb 2012 - sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

[R] sqldf for Very Large Tab Delimited Files

Maybe Matching Threads