thr3ads.net - R devel - [Rd] Doing the right amount of copy for large data frames. [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Gopi Goswami

2008-Apr-14 11:47 UTC

[Rd] Doing the right amount of copy for large data frames.

Hi there,


Problem ::
When one tries to change one or some of the columns of a data.frame, R makes
a copy of the whole data.frame using the '*tmp*' mechanism (this does
not
happen for components of a list, tracemem( ) on R-2.6.2 says so).


Suggested solution ::
Store the columns of the data.frame as a list inside of an environment slot
of an S4 class, and define the '[', '[<-' etc. operators
using setMethod( )
and setReplaceMethod( ).


Question ::
This implementation will violate copy on modify principle of R (since
environments are not copied), but will save a lot of memory. Do you see any
other obvious problem(s) with the idea? Have you seen a related setup
implemented / considered before (apart from the packages like filehash, ff,
and database related ones for saving memory)?


Implementation code snippet ::
### The S4 class.
setClass('DataFrame',
              representation(data = 'data.frame', nrow =
'numeric', ncol 'numeric', store = 'environment'),
              prototype(data = data.frame( ), nrow = 0, ncol = 0))

setMethod('initialize', 'DataFrame', function(.Object) {
    .Object <- callNextMethod( )
    .Object@store <- new.env(hash = TRUE)
    assign('data', as.list(.Object@data), .Object@store)
    .Object@nrow <- nrow(.Object@data)
    .Object@ncol <- ncol(.Object@data)
    .Object@data <- data.frame( )
    .Object
})


### Usage:
nn  <- 10
## dd1 below could possibly be created by read.table or scan and data.frame
dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
dd2 <- new('DataFrame', data = dd1)
rm(dd1)
## Now work with dd2


Thanks a lot,
Gopi Goswami.
PhD, Statistics, 2005
http://gopi-goswami.net/index.html

	[[alternative HTML version deleted]]

Peter Dalgaard

2008-Apr-14 14:11 UTC

head link

[Rd] Doing the right amount of copy for large data frames.

Gopi Goswami wrote:> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R
makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this
does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators
using setMethod( )
> and setReplaceMethod( ).
>
>
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea? Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
>
>   A short --- although crass --- reply is that you should not meddle with
this until you know _exactly_ what you are doing....

Two main points are that (a) copying of dataframes in principle only
copies pointers to each variable, until the actual contents are modified
and (b) breaking copy-on-modify (and consequently effectively also break
pass-by-value) semantics is a source of unhappiness.

R does duplicate rather more than it needs to, but the main reason
probably lies in its rudimentary reference tracking (the NAMED entry in
the object header structure). Some of us do wish we could try and fix
this at some point, but it would be a major undertaking. (There are a
zillion places where we'd need to do extra housekeeping rather than let
the garbage collector tidy up after us. Also, reference-counting
solutions from other computer languages do not apply because R can have
circular references.)

> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow =
'numeric', ncol > 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>   

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Tony Plate

2008-Apr-14 14:18 UTC

head link

[Rd] Doing the right amount of copy for large data frames.

Gopi Goswami wrote:> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R
makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this
does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators
using setMethod( )
> and setReplaceMethod( ).
>
>
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea?Well, because it violates the copy-on-modify principle it can 
potentially break code that depends on this principle.  I don't know how 
much there is -- did you try to see if R and recommended packages will 
pass checks with this change in place?>  Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>   I've frequently used a personal package that stores array data in a file 
(like ff).  It works fine, and I partially get around the problem of 
violating the copy-on-modify principle by having a readonly flag in the 
object -- when the flag is set to allow modification I have to be 
careful, but after I set it to readonly I can use it more freely with 
the knowledge that if some function does attempt to modify the object, 
it will stop with an error.

In this particular case, why not just track down why data frame 
modification is copying the entire object and suggest a change so that 
it just copies the column being changed?  (should be possible if list 
modification doesn't copy all components).

-- Tony Plate>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow =
'numeric', ncol > 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Martin Morgan

2008-Apr-14 15:59 UTC

head link

[Rd] Doing the right amount of copy for large data frames.

Hi Gopi

"Gopi Goswami" <grgoswami at gmail.com> writes:
> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R
makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this
does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators
using setMethod( )
> and setReplaceMethod( ).
The Biocondcutor package Biobase has a class 'ExpressionSet' with slot
assayData. By default assayData is an environment that is 'locked' so
can't be modified casually. The interface to ExpressionSet unlocks the
environment, and copies and modifies it when necessary. This is not
quite the same as you propose, but has some similar characteristics.

I've spent a lot of time with this data structure, and think this
borders on one of those ideas that 'seemed like a good idea at the
time'. You end up using R-level tools to manage memory. Copy-on-change
is better than you might naively think at not making unnecessary
copies. S4 caries significant overhead, including copies during method
dispatch, that work against you (subsetting an expression set in an
OOP way, no behind-the-scenes tricks, makes *5* copies of the S4
instance, though perhaps these are light-weight because the big data
is in an environment). And in the mean time computers have gotten
faster and bigger, and the 'big' data of ExpressionSets are now only
modestly sized or even small.

A somewhat different approach is in the Biostrings package, for
instance DNAStringSet, where the original object is 'read-only'. The
user is presented with a 'view' into the object; changing the view
(subsetting) changes the indicies in the view but not the original
data. This is both fast and memory efficient. This is a read-only
solution, though.

Hope that helps, Martin
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea? Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow =
'numeric', ncol > 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Apr 2008 - Doing the right amount of copy for large data frames.

[Rd] Doing the right amount of copy for large data frames.

[Rd] Doing the right amount of copy for large data frames.

[Rd] Doing the right amount of copy for large data frames.

[Rd] Doing the right amount of copy for large data frames.

Seemingly Similar Threads