thr3ads.net - R help - [R] read.table() 1Gb text dataframe [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Stephen HK Wong

2014-Sep-18 23:48 UTC

[R] read.table() 1Gb text dataframe

Dear All,

I have a table of 4 columns and many millions rows separated by tab-delimited. I
don't have enough memory to read.table in that 1 Gb file. And actually I
have 12 text files like that. Is there a way that I can just randomly
read.table() in 10% of rows ? I was able to do that using colbycol package, but
it is not not available. Many thanks!!



Stephen HK Wong
Stanford, California 94305-5324

Henrik Bengtsson

2014-Sep-19 01:33 UTC

head link

[R] read.table() 1Gb text dataframe

As a start, make sure you specify the 'colClasses' argument.  BTW,
using that you can even go to the extreme and read one column at the
time, if it comes down to that.

To read a 10% subset of the rows, you can use R.filesets as:

library(R.filesets)
db <- TabularTextFile(pathname)
n <- nbrOfRows(db)
data <- readDataFrame(db, rows=seq(from=1, to=n, length.out=0.10*n))

It is also useful to specify 'colClasses' here. In addition to
specifying them ordered by column, as for read.table(), you also
specify them by column names (or regular expressions of the column
names), e.g.

data <- readDataFrame(db, colClasses=c("*"="NULL",
"(x|y)"="integer",
outcome="numeric", "id"="character"),
rows=seq(from=1, to=n,
length.out=0.10*n))

That 'colClasses' specifies that the default is drop all columns, read
columns 'x' and 'y' as integers, and so on.

BTW, if you know 'n' upfront you can skip the setup of TabularTextFile
and just do:

data <- readDataFrame(pathname, rows=seq(from=1, to=n, length.out=0.10*n))

Hope this helps

Henrik

On Thu, Sep 18, 2014 at 4:48 PM, Stephen HK Wong <honkit at stanford.edu>
wrote:> Dear All,
>
> I have a table of 4 columns and many millions rows separated by
tab-delimited. I don't have enough memory to read.table in that 1 Gb file.
And actually I have 12 text files like that. Is there a way that I can just
randomly read.table() in 10% of rows ? I was able to do that using colbycol
package, but it is not not available. Many thanks!!
>
>
>
> Stephen HK Wong
> Stanford, California 94305-5324
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Greg Snow

2014-Sep-19 15:57 UTC

head link

[R] read.table() 1Gb text dataframe

When working with datasets too large to fit in memory it is usually
best to use an actual database, read the data into the database, then
pull the records that you want into R.  There are several packages for
working with databases, but 2 of the simplest are the RSQLite and
sqldf packages (installing them will install the database backend for
you).  The read.csv.sql function in the sqldf package will read in a
csv file by first reading it into the database, then pulling the
desired subset (you need to know some basic sql) into R, all the
database stuff is handled in the background for you.

On Thu, Sep 18, 2014 at 5:48 PM, Stephen HK Wong <honkit at stanford.edu>
wrote:> Dear All,
>
> I have a table of 4 columns and many millions rows separated by
tab-delimited. I don't have enough memory to read.table in that 1 Gb file.
And actually I have 12 text files like that. Is there a way that I can just
randomly read.table() in 10% of rows ? I was able to do that using colbycol
package, but it is not not available. Many thanks!!
>
>
>
> Stephen HK Wong
> Stanford, California 94305-5324
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com

R help - Sep 2014 - read.table() 1Gb text dataframe

[R] read.table() 1Gb text dataframe

[R] read.table() 1Gb text dataframe

[R] read.table() 1Gb text dataframe