Dear list, I need to read in a big table with 487 columns and 238,305 rows (row names and column names are supplied). Is there a code to read in the table in a fast way? I tried the read.table() but it seems that it takes forever :( Thanks a lot! Best, Allen
If they are all numeric, you can use 'scan' to read them in. With that amount of data, you will need almost 1GB to contain the single object. If you want to do any processing, you will probably need a machine with at least 3-4GB of physical memory, preferrably a 64-bit version of R. What type of computer are you using? Do you really need all the data in at once, or can you process it in smaller batches (e.g., 20,000 rows at a time)? So a little more detail on what you actually want to do with the data would be useful, since it does create a very large object. BTW how large is the file you are reading and what is its format? Have you considered a database with this amount of data? On Nov 9, 2007 11:39 PM, affy snp <affysnp at gmail.com> wrote:> Dear list, > > I need to read in a big table with 487 columns and 238,305 rows (row names > and column names are supplied). Is there a code to read in the table in > a fast way? I tried the read.table() but it seems that it takes forever :( > > Thanks a lot! > > Best, > Allen > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
1. You might be able to speed it up somewhat by specifying colClasses=. 2. Another possibility is that the devel version of the sqldf package provides an interface which simplifies reading a data file into sqlite and from there into R. This is particularly useful if you don't want to read it all in. See example 6 on the home page: http://sqldf.googlecode.com 3. If it doesn't change and its ok to read it in slowly once then just read it in slowly and save() it. Then you can load() it on subsequent runs which should be fast. On Nov 9, 2007 11:39 PM, affy snp <affysnp at gmail.com> wrote:> Dear list, > > I need to read in a big table with 487 columns and 238,305 rows (row names > and column names are supplied). Is there a code to read in the table in > a fast way? I tried the read.table() but it seems that it takes forever :( > > Thanks a lot! > > Best, > Allen > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Thanks Gabor. I made the column names look like as: probeset WM806SignalA WM806call WM1716SignalA WM1716call .... And I then tried what you mentioned and got:> library(sqldf)Loading required package: gsubfn Loading required package: proto> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > myfile <- file("243_47mel_withnormal_expression_log2.txt") > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > stmt <- stmt[regexpr("call", stmt) < 0] > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > DF <- sqldf(stmt, file.format = list(sep = " "))Error in summary.connection(get(fo, envir)) : invalid connection>How should I correct this? Thanks! Allen On Nov 10, 2007 1:11 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Nov 10, 2007 12:25 AM, affy snp <affysnp at gmail.com> wrote: > > Hi Gabor, > > > > Thanks a lot! > > > > The header of the big file looks like as follows: > > > > probe_set > > WM_806_Signal_A > > WM_806_call > > WM_1716_Signal_A > > WM_1716_call > > .... > > > > I only need those columns with the header as like _Signal_A > > > > Can you suggest how to use sqldf? > > > > sqlite requires that a single character separate the fields. Use sed or > other method to reduce multiple spaces to one space in the input > file and then try something like this: > > library(sqldf) > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > # get headings > myfile <- file("myfile.dat") > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > > # assume any column with call in its name is to be eliminated > # and form select statement > stmt <- stmt[regexpr("call", stmt) < 0] > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > > # run it > myfile <- file("myfile.dat") > DF <- sqldf(stmt, file.format = list(sep = " ")) >
You left out the 2nd last line. Also did you replace multiple spaces in the input file with one space? On Nov 10, 2007 1:26 AM, affy snp <affysnp at gmail.com> wrote:> Thanks Gabor. > > I made the column names look like as: > > probeset > WM806SignalA > WM806call > WM1716SignalA > WM1716call > .... > > And I then tried what you mentioned and got: > > > > library(sqldf) > Loading required package: gsubfn > Loading required package: proto > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > myfile <- file("243_47mel_withnormal_expression_log2.txt") > > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > > stmt <- stmt[regexpr("call", stmt) < 0] > > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > > DF <- sqldf(stmt, file.format = list(sep = " ")) > Error in summary.connection(get(fo, envir)) : invalid connection > > > > How should I correct this? > > Thanks! > > Allen > > > On Nov 10, 2007 1:11 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > On Nov 10, 2007 12:25 AM, affy snp <affysnp at gmail.com> wrote: > > > Hi Gabor, > > > > > > Thanks a lot! > > > > > > The header of the big file looks like as follows: > > > > > > probe_set > > > WM_806_Signal_A > > > WM_806_call > > > WM_1716_Signal_A > > > WM_1716_call > > > .... > > > > > > I only need those columns with the header as like _Signal_A > > > > > > Can you suggest how to use sqldf? > > > > > > > sqlite requires that a single character separate the fields. Use sed or > > other method to reduce multiple spaces to one space in the input > > file and then try something like this: > > > > library(sqldf) > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > > > # get headings > > myfile <- file("myfile.dat") > > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > > > > # assume any column with call in its name is to be eliminated > > # and form select statement > > stmt <- stmt[regexpr("call", stmt) < 0] > > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > > > > # run it > > myfile <- file("myfile.dat") > > DF <- sqldf(stmt, file.format = list(sep = " ")) > > >
Did you read the Note on the help page for read.table, or the 'R Data Import/Export Manual'? There are several hints there, some of which will be crucial to doing this reasonably fast. How big is your computer? That is 116 million items (you haven't told us what type they are), so you will need GBs of RAM, and preferably a 64-bit OS. Otherwise you would be better off using a DBMS to store the data (see the Manual mentioned in my first para). On Fri, 9 Nov 2007, affy snp wrote:> Dear list, > > I need to read in a big table with 487 columns and 238,305 rows (row names > and column names are supplied). Is there a code to read in the table in > a fast way? I tried the read.table() but it seems that it takes forever :( > > Thanks a lot! > > Best, > Allen-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Hi Gabor, I replaced multiple spaces with a single one and tried the code you suggested. I got:> library(sqldf)Loading required package: RSQLite Loading required package: DBI Loading required package: gsubfn Loading required package: proto> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > myfile <- file("243_47mel_withnormal_expression_log2.txt") > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > stmt <- stmt[regexpr("call", stmt) < 0] > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > myfile <- file("243_47mel_withnormal_expression_log2.txt") > DF <- sqldf(stmt, file.format = list(sep = " "))Error in try({ : RS-DBI driver: (RS_sqlite_import: ./243_47mel_withnormal_expression_log2.txt line 6651 expected 488 columns of data but found 641) In addition: Warning message: closing unused connection 3 (243_47mel_withnormal_expression_log2.txt) Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: "2SignalA")>What can you suggest? Sth wrong with the input file you can think of? Thanks! Allen On Nov 10, 2007 10:37 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Thanks. > > > On Nov 10, 2007 10:29 AM, affy snp <affysnp at gmail.com> wrote: > > Gabor, > > > > I will do it either later today or tomorrow. Promised. > > > > Allen > > > > > > On Nov 10, 2007 10:23 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > > Please try out the sqldf solution as well and let me know > > > how it compares since I have never tried anything > > > this large and would be interested to know. > > > > > > > > > On Nov 10, 2007 9:27 AM, affy snp <affysnp at gmail.com> wrote: > > > > Thanks all for the help and suggestions. By specifying the colClass in > > > > read.table() > > > > and running it on a server with 8Gb memory, I could have the data read > > > > in 2 mins. > > > > I will just skip sqldf method for now and get back in a moment. > > > > > > > > Best, > > > > Allen > > > > > > > > > > > > On Nov 10, 2007 2:42 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote: > > > > > Did you read the Note on the help page for read.table, or the 'R Data > > > > > Import/Export Manual'? There are several hints there, some of which will > > > > > be crucial to doing this reasonably fast. > > > > > > > > > > How big is your computer? That is 116 million items (you haven't told us > > > > > what type they are), so you will need GBs of RAM, and preferably a 64-bit > > > > > OS. Otherwise you would be better off using a DBMS to store the data (see > > > > > the Manual mentioned in my first para). > > > > > > > > > > > > > > > On Fri, 9 Nov 2007, affy snp wrote: > > > > > > > > > > > Dear list, > > > > > > > > > > > > I need to read in a big table with 487 columns and 238,305 rows (row names > > > > > > and column names are supplied). Is there a code to read in the table in > > > > > > a fast way? I tried the read.table() but it seems that it takes forever :( > > > > > > > > > > > > Thanks a lot! > > > > > > > > > > > > Best, > > > > > > Allen > > > > > > > > > > -- > > > > > Brian D. Ripley, ripley at stats.ox.ac.uk > > > > > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > > > > > University of Oxford, Tel: +44 1865 272861 (self) > > > > > 1 South Parks Road, +44 1865 272866 (PA) > > > > > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > > > > > > > > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > >
On Nov 11, 2007 2:28 PM, affy snp <affysnp at gmail.com> wrote:> Hi Gabor, > > I replaced multiple spaces with a single one and tried > the code you suggested. I got: > > > library(sqldf) > Loading required package: RSQLite > Loading required package: DBI > Loading required package: gsubfn > Loading required package: proto > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > myfile <- file("243_47mel_withnormal_expression_log2.txt") > > stmt <- read.table(myfile, nr = 1, as.is = TRUE) > > stmt <- stmt[regexpr("call", stmt) < 0] > > stmt <- paste("select", paste(stmt, collapse = ","), "from myfile") > > myfile <- file("243_47mel_withnormal_expression_log2.txt") > > DF <- sqldf(stmt, file.format = list(sep = " ")) > Error in try({ : > RS-DBI driver: (RS_sqlite_import: > ./243_47mel_withnormal_expression_log2.txt line 6651 expected 488 > columns of data but found 641)The error message says that its expecting 488 columns but found 641 columns on line 6651. Examine line 6651 to see what is wrong. Also try it with just the first few rows of the file and if that works that is even greater evidence that there is something unexpected with your file.