thr3ads.net - R help - [R] How to more efficently read in a big matrix [Nov 2007]

If this information is useful, please help other people find it:
Share via:

affy snp

2007-Nov-10 04:39 UTC

[R] How to more efficently read in a big matrix

Dear list,

I need to read in a big table with 487 columns and 238,305 rows (row names
and column names are supplied). Is there a code to read in the table in
a fast way? I tried the read.table() but it seems that it takes forever :(

Thanks a lot!

Best,
    Allen

jim holtman

2007-Nov-10 04:46 UTC

head link

[R] How to more efficently read in a big matrix

If they are all numeric, you can use 'scan' to read them in.  With
that amount of data, you will need almost 1GB to contain the single
object.  If you want to do any processing, you will probably need a
machine with at least 3-4GB of physical memory, preferrably a 64-bit
version of R.  What type of computer are you using?  Do you really
need all the data in at once, or can you process it in smaller batches
(e.g., 20,000 rows at a time)?  So a little more detail on what you
actually want to do with the data would be useful, since it does
create a very large object.  BTW how large is the file you are reading
and what is its format?  Have you considered a database with this
amount of data?

On Nov 9, 2007 11:39 PM, affy snp <affysnp at gmail.com>
wrote:> Dear list,
>
> I need to read in a big table with 487 columns and 238,305 rows (row names
> and column names are supplied). Is there a code to read in the table in
> a fast way? I tried the read.table() but it seems that it takes forever :(
>
> Thanks a lot!
>
> Best,
>    Allen
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

Gabor Grothendieck

2007-Nov-10 04:47 UTC

head link

[R] How to more efficently read in a big matrix

1. You might be able to speed it up somewhat by specifying
colClasses=.

2. Another possibility is that the devel version of
the sqldf package provides an interface which simplifies reading a data file
into sqlite and from there into R.  This is particularly useful if you
don't want to read it all in.  See example 6 on the home page:
http://sqldf.googlecode.com

3. If it doesn't change and its ok to read it in slowly once then just
read it in slowly and save() it.  Then you can load()
it on subsequent runs which should be fast.

On Nov 9, 2007 11:39 PM, affy snp <affysnp at gmail.com>
wrote:> Dear list,
>
> I need to read in a big table with 487 columns and 238,305 rows (row names
> and column names are supplied). Is there a code to read in the table in
> a fast way? I tried the read.table() but it seems that it takes forever :(
>
> Thanks a lot!
>
> Best,
>    Allen
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

affy snp

2007-Nov-10 06:26 UTC

head link

[R] How to more efficently read in a big matrix

Thanks Gabor.

I made the column names look like as:

probeset
WM806SignalA
WM806call
WM1716SignalA
WM1716call
....

And I then tried what you mentioned and got:

> library(sqldf)Loading required package: gsubfn
Loading required package: proto> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> myfile <- file("243_47mel_withnormal_expression_log2.txt")
> stmt <- read.table(myfile, nr = 1, as.is = TRUE)
> stmt <- stmt[regexpr("call", stmt) < 0]
> stmt <- paste("select", paste(stmt, collapse = ","),
"from myfile")
> DF <- sqldf(stmt, file.format = list(sep = " "))Error in summary.connection(get(fo, envir)) : invalid
connection>
How should I correct this?

Thanks!

Allen

On Nov 10, 2007 1:11 AM, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> On Nov 10, 2007 12:25 AM, affy snp <affysnp at gmail.com> wrote:
> > Hi Gabor,
> >
> > Thanks a lot!
> >
> > The header of the big file looks like as follows:
> >
> > probe_set
> > WM_806_Signal_A
> > WM_806_call
> > WM_1716_Signal_A
> > WM_1716_call
> > ....
> >
> > I only need those columns with the header as like _Signal_A
> >
> > Can you suggest how to use sqldf?
> >
>
> sqlite requires that a single character separate the fields.  Use sed or
> other method to reduce multiple spaces to one space in the input
> file and then try something like this:
>
> library(sqldf)
> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
>
> # get headings
> myfile <- file("myfile.dat")
> stmt <- read.table(myfile, nr = 1, as.is = TRUE)
>
> # assume any column with call in its name is to be eliminated
> # and form select statement
> stmt <- stmt[regexpr("call", stmt) < 0]
> stmt <- paste("select", paste(stmt, collapse = ","),
"from myfile")
>
> # run it
> myfile <- file("myfile.dat")
> DF <- sqldf(stmt, file.format = list(sep = " "))
>

Gabor Grothendieck

2007-Nov-10 06:29 UTC

head link

[R] How to more efficently read in a big matrix

You left out the 2nd last line.  Also did you replace multiple spaces
in the input file with one space?

On Nov 10, 2007 1:26 AM, affy snp <affysnp at gmail.com>
wrote:> Thanks Gabor.
>
> I made the column names look like as:
>
> probeset
> WM806SignalA
> WM806call
> WM1716SignalA
> WM1716call
> ....
>
> And I then tried what you mentioned and got:
>
>
> > library(sqldf)
> Loading required package: gsubfn
> Loading required package: proto
> > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> > myfile <-
file("243_47mel_withnormal_expression_log2.txt")
> > stmt <- read.table(myfile, nr = 1, as.is = TRUE)
> > stmt <- stmt[regexpr("call", stmt) < 0]
> > stmt <- paste("select", paste(stmt, collapse =
","), "from myfile")
> > DF <- sqldf(stmt, file.format = list(sep = " "))
> Error in summary.connection(get(fo, envir)) : invalid connection
> >
>
> How should I correct this?
>
> Thanks!
>
> Allen
>
>
> On Nov 10, 2007 1:11 AM, Gabor Grothendieck <ggrothendieck at
gmail.com> wrote:
> > On Nov 10, 2007 12:25 AM, affy snp <affysnp at gmail.com> wrote:
> > > Hi Gabor,
> > >
> > > Thanks a lot!
> > >
> > > The header of the big file looks like as follows:
> > >
> > > probe_set
> > > WM_806_Signal_A
> > > WM_806_call
> > > WM_1716_Signal_A
> > > WM_1716_call
> > > ....
> > >
> > > I only need those columns with the header as like _Signal_A
> > >
> > > Can you suggest how to use sqldf?
> > >
> >
> > sqlite requires that a single character separate the fields.  Use sed
or
> > other method to reduce multiple spaces to one space in the input
> > file and then try something like this:
> >
> > library(sqldf)
> > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> >
> > # get headings
> > myfile <- file("myfile.dat")
> > stmt <- read.table(myfile, nr = 1, as.is = TRUE)
> >
> > # assume any column with call in its name is to be eliminated
> > # and form select statement
> > stmt <- stmt[regexpr("call", stmt) < 0]
> > stmt <- paste("select", paste(stmt, collapse =
","), "from myfile")
> >
> > # run it
> > myfile <- file("myfile.dat")
> > DF <- sqldf(stmt, file.format = list(sep = " "))
> >
>

Prof Brian Ripley

2007-Nov-10 07:42 UTC

head link

[R] How to more efficently read in a big matrix

Did you read the Note on the help page for read.table, or the 'R Data 
Import/Export Manual'?  There are several hints there, some of which will 
be crucial to doing this reasonably fast.

How big is your computer?  That is 116 million items (you haven't told us 
what type they are), so you will need GBs of RAM, and preferably a 64-bit 
OS.  Otherwise you would be better off using a DBMS to store the data (see 
the Manual mentioned in my first para).

On Fri, 9 Nov 2007, affy snp wrote:
> Dear list,
>
> I need to read in a big table with 487 columns and 238,305 rows (row names
> and column names are supplied). Is there a code to read in the table in
> a fast way? I tried the read.table() but it seems that it takes forever :(
>
> Thanks a lot!
>
> Best,
>    Allen
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

affy snp

2007-Nov-11 19:28 UTC

head link

[R] How to more efficently read in a big matrix

Hi Gabor,

I replaced multiple spaces with a single one and tried
the code you suggested. I got:
> library(sqldf)Loading required package: RSQLite
Loading required package: DBI
Loading required package: gsubfn
Loading required package: proto> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> myfile <- file("243_47mel_withnormal_expression_log2.txt")
> stmt <- read.table(myfile, nr = 1, as.is = TRUE)
> stmt <- stmt[regexpr("call", stmt) < 0]
> stmt <- paste("select", paste(stmt, collapse = ","),
"from myfile")
> myfile <- file("243_47mel_withnormal_expression_log2.txt")
> DF <- sqldf(stmt, file.format = list(sep = " "))Error in try({ :
  RS-DBI driver: (RS_sqlite_import:
./243_47mel_withnormal_expression_log2.txt line 6651 expected 488
columns of data but found 641)
In addition: Warning message:
closing unused connection 3 (243_47mel_withnormal_expression_log2.txt)
Error in sqliteExecStatement(con, statement, bind.data) :
  RS-DBI driver: (error in statement: unrecognized token:
"2SignalA")>
What can you suggest? Sth wrong with the input file you can think of?

Thanks!

Allen

On Nov 10, 2007 10:37 AM, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> Thanks.
>
>
> On Nov 10, 2007 10:29 AM, affy snp <affysnp at gmail.com> wrote:
> > Gabor,
> >
> > I will do it either later today or tomorrow. Promised.
> >
> > Allen
> >
> >
> > On Nov 10, 2007 10:23 AM, Gabor Grothendieck <ggrothendieck at
gmail.com> wrote:
> > > Please try out the sqldf solution as well and let me know
> > > how it compares since I have never tried anything
> > > this large and would be interested to know.
> > >
> > >
> > > On Nov 10, 2007 9:27 AM, affy snp <affysnp at gmail.com>
wrote:
> > > > Thanks all for the help and suggestions. By specifying the
colClass in
> > > > read.table()
> > > > and running it on a server with 8Gb memory, I could have the
data read
> > > > in 2 mins.
> > > > I will just skip sqldf method for now and get back in a
moment.
> > > >
> > > > Best,
> > > >       Allen
> > > >
> > > >
> > > > On Nov 10, 2007 2:42 AM, Prof Brian Ripley <ripley at
stats.ox.ac.uk> wrote:
> > > > > Did you read the Note on the help page for read.table,
or the 'R Data
> > > > > Import/Export Manual'?  There are several hints
there, some of which will
> > > > > be crucial to doing this reasonably fast.
> > > > >
> > > > > How big is your computer?  That is 116 million items
(you haven't told us
> > > > > what type they are), so you will need GBs of RAM, and
preferably a 64-bit
> > > > > OS.  Otherwise you would be better off using a DBMS to
store the data (see
> > > > > the Manual mentioned in my first para).
> > > > >
> > > > >
> > > > > On Fri, 9 Nov 2007, affy snp wrote:
> > > > >
> > > > > > Dear list,
> > > > > >
> > > > > > I need to read in a big table with 487 columns and
238,305 rows (row names
> > > > > > and column names are supplied). Is there a code to
read in the table in
> > > > > > a fast way? I tried the read.table() but it seems
that it takes forever :(
> > > > > >
> > > > > > Thanks a lot!
> > > > > >
> > > > > > Best,
> > > > > >    Allen
> > > > >
> > > > > --
> > > > > Brian D. Ripley,                  ripley at
stats.ox.ac.uk
> > > > > Professor of Applied Statistics, 
http://www.stats.ox.ac.uk/~ripley/
> > > > > University of Oxford,             Tel:  +44 1865 272861
(self)
> > > > > 1 South Parks Road,                     +44 1865 272866
(PA)
> > > > > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > > > >
> > > >
> > >
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible
code.
> > > >
> > >
> >
>

Gabor Grothendieck

2007-Nov-11 23:04 UTC

head link

[R] How to more efficently read in a big matrix

On Nov 11, 2007 2:28 PM, affy snp <affysnp at gmail.com>
wrote:> Hi Gabor,
>
> I replaced multiple spaces with a single one and tried
> the code you suggested. I got:
>
> > library(sqldf)
> Loading required package: RSQLite
> Loading required package: DBI
> Loading required package: gsubfn
> Loading required package: proto
> > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> > myfile <-
file("243_47mel_withnormal_expression_log2.txt")
> > stmt <- read.table(myfile, nr = 1, as.is = TRUE)
> > stmt <- stmt[regexpr("call", stmt) < 0]
> > stmt <- paste("select", paste(stmt, collapse =
","), "from myfile")
> > myfile <-
file("243_47mel_withnormal_expression_log2.txt")
> > DF <- sqldf(stmt, file.format = list(sep = " "))
> Error in try({ :
>  RS-DBI driver: (RS_sqlite_import:
> ./243_47mel_withnormal_expression_log2.txt line 6651 expected 488
> columns of data but found 641)
The error message says that its expecting 488 columns
but found 641 columns on line 6651.  Examine line 6651 to see
what is wrong.

Also try it with just the first few rows of the file and if that works
that is even greater evidence that there is something unexpected with
your file.

Reasonably Related Threads

Search for more possibly parallel threads

R help - Nov 2007 - How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

[R] How to more efficently read in a big matrix

Reasonably Related Threads