thr3ads.net - R devel - Representation of data in libraries [Feb 1998]

If this information is useful, please help other people find it:
Share via:

Douglas Bates

1998-Feb-24 18:10 UTC

Representation of data in libraries

At present the example data sets in R libraries are to be given as
expressions that can be read directly into R.  For example, the acid.R 
file in the main library looks like
 acid <- data.frame(
  carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
  optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6))

This is great when you have only a few observations.  I have one
example data set with over 9000 rows and 17 variables.  Even when I
set -v 40, I exhaust the available memory trying to read it in as a
data.frame.  I believe this is because of the recursive nature of the
parsing of data objects.

Are there alternatives that would cause less memory usage?  In
S/S-PLUS the data.dump/data.restore functions use a portable
representation that can be parsed without exponential memory growth.
-- 
Douglas Bates                            bates@stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Thomas Lumley

1998-Feb-24 21:16 UTC

head link

Representation of data in libraries

On 24 Feb 1998, Douglas Bates wrote:
> At present the example data sets in R libraries are to be given as
> expressions that can be read directly into R.  For example, the acid.R 
> file in the main library looks like
>  acid <- data.frame(
>   carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
>   optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names =
paste(1:6))
> 
> This is great when you have only a few observations.  I have one
> example data set with over 9000 rows and 17 variables.  Even when I
> set -v 40, I exhaust the available memory trying to read it in as a
> data.frame. 
You need to specify -n some_large_number to read in large data sets,
specifying -v is not enough.  You can see this by using gcinfo(T) to
report heap and cons cell usage at each garbage collection.
> Are there alternatives that would cause less memory usage?  In
> S/S-PLUS the data.dump/data.restore functions use a portable
> representation that can be parsed without exponential memory growth.
The R save() format is portable, at least among Unices. You could have the
data.R file contain the command
	eval(load("data.Rdata"),.GlobalEnv)
where "data.Rdata" is the saved file. There is an ascii=T option,
which
might make it more portable to other operating systems.

I haven't checked, but I assume that this format can be read more
efficiently than sourcing R code.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Martin Maechler

1998-Feb-25 07:34 UTC

head link

Representation of data in libraries

>>>>> "DougB" == Douglas Bates
<bates@stat.wisc.edu> writes:
    DougB> At present the example data sets in R libraries are to be given as
    DougB> expressions that can be read directly into R.  For example, the
acid.R
    DougB> file in the main library looks like
    DougB> acid <- data.frame(
    DougB> carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
    DougB> optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names =
paste(1:6))

    DougB> This is great when you have only a few observations.  I have one
    DougB> example data set with over 9000 rows and 17 variables.  Even when
I
    DougB> set -v 40, I exhaust the available memory trying to read it in as
a
    DougB> data.frame.  I believe this is because of the recursive nature of
the
    DougB> parsing of data objects.

yes; 

    DougB> Are there alternatives that would cause less memory usage?

yes; but only in the 0.62 development version.
The current 0.62 ``standard'' is:

if a 'data' file ends in
	.R,	source(.) is used to read it
if it ends in
	.tab	read.table(..., header = TRUE)  is used to read it.
(you find the new data(.) function in  src/library/base/data in R-snapshot.)

Note that this is still not really satisfactory for large data files,
since read.table(.) is not really efficient:
	it first reads everything as character matrix and then converts
	variable by variable, some to numeric, some to factor.

On the other hand: does it really make sense to distribute huge example
data sets as yours above?
If yes, AND if you have only numeric data,
I'd propose the following:
 1) create a  <pkg>/data/dougBex.R
    file which only contains something like
	dougBex <- as.data.frame(
		    matrix(scan(system.file("<pkg>/data/dougBex.dat")),
			   ncol = ...,  
			   dimnames = ...))
 2) create   <pkg>/data/dougBex.dat  to contain all your data, white-space
				     delimited numeric.


    DougB> In S/S-PLUS the data.dump/data.restore functions use a portable
    DougB> representation that can be parsed without exponential memory
growth.

hmm, yes, we have been longing for someone to write  data.dump/data.restore
for R.
	Any volunteers?

--
Martin Maechler <maechler@stat.math.ethz.ch>			<><
Seminar fuer Statistik, ETH-Zentrum SOL G1;	Sonneggstr.33
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1086
http://www.stat.math.ethz.ch/~maechler/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Maybe Matching Threads

Search for more maybe matching threads

R devel - Feb 1998 - Representation of data in libraries

Representation of data in libraries

Representation of data in libraries

Representation of data in libraries

Maybe Matching Threads