Hi All, I have a very (very) large tab-delimited text file without headers. There are only 8 columns and millions of rows. I want to make numerous pieces of this file by sub-setting it for individual stations. Station is given as in the first column. I am trying to learn and use sqldf package for this but am stuck in a couple of places. To simulate my requirement, I have taken iris dataset as an example and have done the following: (1) create a tab-delimited file without headers. (2) read it using read.csv.sql command (3) write the result of a query, getting first 10 records Here is the reproducible code that I am trying: # Text data file write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, col.names=FALSE,row.names = FALSE) # create an empty database (can skip this step if database already exists) sqldf("attach myTestdbT as new") f1<-file("irisNoH.txt") attr(f1, "file.format") <- list(header=FALSE,sep="\t") # read into table called irisTab in the mytestdb sqlite database read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as select * from file", dbname = "mytestdb") res1<-sqldf("select * from main.irisTab1 limit 10", dbname = "mytestdb") write.table(res1, "iris10.txt", sep = "\t", quote = FALSE, col.names=FALSE,row.names = FALSE) # For querying records of a particular species - unresolved problems #a1<-"virginica" #attr(f1, "names") <- c("A1","A2","A3","A4","A5") #res2<-fn$sqldf("select * from main.irisTab1 where A5 = '$a1'") In the above, I am not able to: (1) assign the names to various columns (2) query for particular value of a column; in this case for particular species, say virginica (3) I guess fn$sqldf can do the job but it requires assigning column names Any help would be most appreciated. Thanks HC -- View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4350555.html Sent from the R help mailing list archive at Nabble.com.
On Wed, Feb 1, 2012 at 11:57 PM, HC <hcatbr at yahoo.co.in> wrote:> Hi All, > > I have a very (very) large tab-delimited text file without headers. There > are only 8 columns and millions of rows. I want to make numerous pieces of > this file by sub-setting it for individual stations. Station is given as in > the first column. I am trying to learn and use sqldf package for this but am > stuck in a couple of places. > > To simulate my requirement, I have taken iris dataset as an example and have > done the following: > (1) create a tab-delimited file without headers. > (2) read it using read.csv.sql command > (3) write the result of a query, getting first 10 records > > Here is the reproducible code that I am trying: > # Text data file > write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, > col.names=FALSE,row.names = FALSE) > # create an empty database (can skip this step if database already exists) > sqldf("attach myTestdbT as new") > f1<-file("irisNoH.txt") > attr(f1, "file.format") <- list(header=FALSE,sep="\t") > # read into table called irisTab in the mytestdb sqlite database > read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as select * > from file", dbname = "mytestdb") > res1<-sqldf("select * from main.irisTab1 limit 10", dbname = "mytestdb") > write.table(res1, "iris10.txt", sep = "\t", quote = FALSE, > col.names=FALSE,row.names = FALSE) > > # For querying records of a particular species - unresolved problems > #a1<-"virginica" > #attr(f1, "names") <- c("A1","A2","A3","A4","A5") > #res2<-fn$sqldf("select * from main.irisTab1 where A5 = '$a1'") > > In the above, I am not able to: > (1) assign the names to various columns > (2) query for particular value of a column; in this case for particular > species, say virginica > (3) I guess fn$sqldf can do the job but it requires assigning column names > > Any help would be most appreciated. >Ignoring your iris file for a moment, to query the 5th column (getting its name via sql rather than via R) we can do this: library(sqldf) species <- "virginica" nms <- names(dbGetQuery(con, "select * from iris limit 0")) fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3") Now, sqldf is best used when you are getting the data from R but if you want to store it in a database and just leave it there then you might be better off using RSQLite directly like this (the eol = "\r\n" in the dbWriteTable statement was needed on my Windows system but you may not need that depending on your platform): write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, col.names = FALSE, row.names = FALSE) library(sqldf) library(RSQLite) con <- dbConnect(SQLite(), dbname = "mytestdb") dbWriteTable(con, "iris", "irisNoH.txt", sep = "\t", eol = "\r\n") species <- "virginica" nms <- names(dbGetQuery(con, "select * from iris limit 0")) fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3") dbDisconnect(con) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Fri, Feb 3, 2012 at 6:03 AM, HC <hcatbr at yahoo.co.in> wrote:> Thank you for indicating that SQLite may not handle a file as big as 160 GB. > > Would you know of any utility for *physically splitting *the 160 GB text > file into pieces. And if one can control the splitting at the ?end of a > record. >If they are csv files or similar data files then you could use R or any scripting language to do that. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
This is a 160 GB tab-separated .txt file. It has 9 columns and 3.25x10^9 rows. Can R handle it? Thank you. HC -- View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4354556.html Sent from the R help mailing list archive at Nabble.com.
On Fri, Feb 3, 2012 at 7:37 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Fri, Feb 3, 2012 at 6:03 AM, HC <hcatbr at yahoo.co.in> wrote: >> Thank you for indicating that SQLite may not handle a file as big as 160 GB. >> >> Would you know of any utility for *physically splitting *the 160 GB text >> file into pieces. And if one can control the splitting at the ?end of a >> record. >> > > If they are csv files or similar data files then you could use R or > any scripting language to do that.Or even the *nix `split` command ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
On Fri, Feb 3, 2012 at 8:08 AM, HC <hcatbr at yahoo.co.in> wrote:> This is a 160 GB tab-separated .txt file. It has 9 columns and 3.25x10^9 > rows. > > Can R handle it? >You can process a file N lines at time like this: con <- file("myfile.dat", "r") while(length(Lines <- readLines(con, n = N)) > 0) { ... whatever... } -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Thank you. The readLines command is working fine and I am able to read 10^6 lines in one go and write them using the write.table command. Does this readLines command using a block concept to optimize or goes line by line? Steve has mentioned about *nix and split commands. Would there be any speed benefit as compared to readLines? Thank you. HC -- View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355362.html Sent from the R help mailing list archive at Nabble.com.
Bad news! The readLines command works fine upto a certain limit. Once a few files have been written the R program crashes. I used the following code: ************************* iFile<-"Test.txt" con <- file(iFile, "r") N<-1250000; iLoop<-1 while(length(Lines <- readLines(con, n = N)) > 0 & iLoop<41) { oFile<-paste("Split_",iLoop,".txt",sep="") write.table(Lines, oFile, sep = "\t", quote = FALSE, col.names= FALSE, row.names = FALSE) iLoop<-iLoop+1 } close(con) ******************** With above N=1.25 million, it wrote 28 files of about 57 mb each. That is a total of about 1.6 GB and then crashed. I tried with other values on N and it crashes at about the same place in terms of total size output, i.e., about 1.6 GB. Is this due to any limitation of Windows 7, in terms of not having the pointer after this size? Your insight would be very helpful. Thank you. HC -- View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355679.html Sent from the R help mailing list archive at Nabble.com.
Exactly what does "crashed" mean? What was the error message? How you tried to put: rm(Lines) gc() at the end of the loop to free up and compact memory? If you watch the performance, does the R process seem to be growing in terms of the amount of memory that is being used? You can add: memory.size() before the above statements to see how much memory is being used. This is just some more elementary debugging that you will have to learn when using any system. On Fri, Feb 3, 2012 at 3:22 PM, HC <hcatbr at yahoo.co.in> wrote:> Bad news! > > The readLines command works fine upto a certain limit. Once a few files have > been written the R program crashes. > > I used the following code: > ************************* > iFile<-"Test.txt" > con <- file(iFile, "r") > > N<-1250000; > iLoop<-1 > > while(length(Lines <- readLines(con, n = N)) > 0 & iLoop<41) { > oFile<-paste("Split_",iLoop,".txt",sep="") > ?write.table(Lines, oFile, sep = "\t", quote = FALSE, col.names= FALSE, > row.names = FALSE) > ?iLoop<-iLoop+1 > } > close(con) > ******************** > > With above N=1.25 million, it wrote 28 files of about 57 mb each. That is a > total of about 1.6 GB and then crashed. > I tried with other values on N and it crashes at about the same place in > terms of total size output, i.e., about 1.6 GB. > > Is this due to any limitation of Windows 7, in terms of not having the > pointer after this size? > > Your insight would be very helpful. > > Thank you. > HC > > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4355679.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Thank you Jim for your reply. I could figure out that readLines works fine until 35,841,335 lines (records). When the next line is read to be read, a window with "R for Windows GUI front-end has stopped working" message comes, with an option to close program or checking online for a solution. The tab-separated .txt file I am working on has 9 columns and about 3.25 billion rows. I suspect that the 35,841,336th line is becoming so very big that RAM out runs its capacity. Perhaps all tabs that are expected after each values are missing, making the line so very big, I am not sure. Is there any way to skip this line with readLines or any other function? I am only hoping that the data after this bad line is good and I can read them if I can skip the bad one. Thank you. HC -- View this message in context: http://r.789695.n4.nabble.com/sqldf-for-Very-Large-Tab-Delimited-Files-tp4350555p4357730.html Sent from the R help mailing list archive at Nabble.com.