Gopi Goswami
2008-Apr-14 11:47 UTC
[Rd] Doing the right amount of copy for large data frames.
Hi there, Problem :: When one tries to change one or some of the columns of a data.frame, R makes a copy of the whole data.frame using the '*tmp*' mechanism (this does not happen for components of a list, tracemem( ) on R-2.6.2 says so). Suggested solution :: Store the columns of the data.frame as a list inside of an environment slot of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) and setReplaceMethod( ). Question :: This implementation will violate copy on modify principle of R (since environments are not copied), but will save a lot of memory. Do you see any other obvious problem(s) with the idea? Have you seen a related setup implemented / considered before (apart from the packages like filehash, ff, and database related ones for saving memory)? Implementation code snippet :: ### The S4 class. setClass('DataFrame', representation(data = 'data.frame', nrow = 'numeric', ncol 'numeric', store = 'environment'), prototype(data = data.frame( ), nrow = 0, ncol = 0)) setMethod('initialize', 'DataFrame', function(.Object) { .Object <- callNextMethod( ) .Object@store <- new.env(hash = TRUE) assign('data', as.list(.Object@data), .Object@store) .Object@nrow <- nrow(.Object@data) .Object@ncol <- ncol(.Object@data) .Object@data <- data.frame( ) .Object }) ### Usage: nn <- 10 ## dd1 below could possibly be created by read.table or scan and data.frame dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) dd2 <- new('DataFrame', data = dd1) rm(dd1) ## Now work with dd2 Thanks a lot, Gopi Goswami. PhD, Statistics, 2005 http://gopi-goswami.net/index.html [[alternative HTML version deleted]]
Peter Dalgaard
2008-Apr-14 14:11 UTC
[Rd] Doing the right amount of copy for large data frames.
Gopi Goswami wrote:> Hi there, > > > Problem :: > When one tries to change one or some of the columns of a data.frame, R makes > a copy of the whole data.frame using the '*tmp*' mechanism (this does not > happen for components of a list, tracemem( ) on R-2.6.2 says so). > > > Suggested solution :: > Store the columns of the data.frame as a list inside of an environment slot > of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) > and setReplaceMethod( ). > > > Question :: > This implementation will violate copy on modify principle of R (since > environments are not copied), but will save a lot of memory. Do you see any > other obvious problem(s) with the idea? Have you seen a related setup > implemented / considered before (apart from the packages like filehash, ff, > and database related ones for saving memory)? > > >A short --- although crass --- reply is that you should not meddle with this until you know _exactly_ what you are doing.... Two main points are that (a) copying of dataframes in principle only copies pointers to each variable, until the actual contents are modified and (b) breaking copy-on-modify (and consequently effectively also break pass-by-value) semantics is a source of unhappiness. R does duplicate rather more than it needs to, but the main reason probably lies in its rudimentary reference tracking (the NAMED entry in the object header structure). Some of us do wish we could try and fix this at some point, but it would be a major undertaking. (There are a zillion places where we'd need to do extra housekeeping rather than let the garbage collector tidy up after us. Also, reference-counting solutions from other computer languages do not apply because R can have circular references.)> Implementation code snippet :: > ### The S4 class. > setClass('DataFrame', > representation(data = 'data.frame', nrow = 'numeric', ncol > 'numeric', store = 'environment'), > prototype(data = data.frame( ), nrow = 0, ncol = 0)) > > setMethod('initialize', 'DataFrame', function(.Object) { > .Object <- callNextMethod( ) > .Object at store <- new.env(hash = TRUE) > assign('data', as.list(.Object at data), .Object at store) > .Object at nrow <- nrow(.Object at data) > .Object at ncol <- ncol(.Object at data) > .Object at data <- data.frame( ) > .Object > }) > > > ### Usage: > nn <- 10 > ## dd1 below could possibly be created by read.table or scan and data.frame > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) > dd2 <- new('DataFrame', data = dd1) > rm(dd1) > ## Now work with dd2 > > > Thanks a lot, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Tony Plate
2008-Apr-14 14:18 UTC
[Rd] Doing the right amount of copy for large data frames.
Gopi Goswami wrote:> Hi there, > > > Problem :: > When one tries to change one or some of the columns of a data.frame, R makes > a copy of the whole data.frame using the '*tmp*' mechanism (this does not > happen for components of a list, tracemem( ) on R-2.6.2 says so). > > > Suggested solution :: > Store the columns of the data.frame as a list inside of an environment slot > of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) > and setReplaceMethod( ). > > > Question :: > This implementation will violate copy on modify principle of R (since > environments are not copied), but will save a lot of memory. Do you see any > other obvious problem(s) with the idea?Well, because it violates the copy-on-modify principle it can potentially break code that depends on this principle. I don't know how much there is -- did you try to see if R and recommended packages will pass checks with this change in place?> Have you seen a related setup > implemented / considered before (apart from the packages like filehash, ff, > and database related ones for saving memory)? >I've frequently used a personal package that stores array data in a file (like ff). It works fine, and I partially get around the problem of violating the copy-on-modify principle by having a readonly flag in the object -- when the flag is set to allow modification I have to be careful, but after I set it to readonly I can use it more freely with the knowledge that if some function does attempt to modify the object, it will stop with an error. In this particular case, why not just track down why data frame modification is copying the entire object and suggest a change so that it just copies the column being changed? (should be possible if list modification doesn't copy all components). -- Tony Plate> > Implementation code snippet :: > ### The S4 class. > setClass('DataFrame', > representation(data = 'data.frame', nrow = 'numeric', ncol > 'numeric', store = 'environment'), > prototype(data = data.frame( ), nrow = 0, ncol = 0)) > > setMethod('initialize', 'DataFrame', function(.Object) { > .Object <- callNextMethod( ) > .Object at store <- new.env(hash = TRUE) > assign('data', as.list(.Object at data), .Object at store) > .Object at nrow <- nrow(.Object at data) > .Object at ncol <- ncol(.Object at data) > .Object at data <- data.frame( ) > .Object > }) > > > ### Usage: > nn <- 10 > ## dd1 below could possibly be created by read.table or scan and data.frame > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) > dd2 <- new('DataFrame', data = dd1) > rm(dd1) > ## Now work with dd2 > > > Thanks a lot, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >
Martin Morgan
2008-Apr-14 15:59 UTC
[Rd] Doing the right amount of copy for large data frames.
Hi Gopi "Gopi Goswami" <grgoswami at gmail.com> writes:> Hi there, > > > Problem :: > When one tries to change one or some of the columns of a data.frame, R makes > a copy of the whole data.frame using the '*tmp*' mechanism (this does not > happen for components of a list, tracemem( ) on R-2.6.2 says so). > > > Suggested solution :: > Store the columns of the data.frame as a list inside of an environment slot > of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) > and setReplaceMethod( ).The Biocondcutor package Biobase has a class 'ExpressionSet' with slot assayData. By default assayData is an environment that is 'locked' so can't be modified casually. The interface to ExpressionSet unlocks the environment, and copies and modifies it when necessary. This is not quite the same as you propose, but has some similar characteristics. I've spent a lot of time with this data structure, and think this borders on one of those ideas that 'seemed like a good idea at the time'. You end up using R-level tools to manage memory. Copy-on-change is better than you might naively think at not making unnecessary copies. S4 caries significant overhead, including copies during method dispatch, that work against you (subsetting an expression set in an OOP way, no behind-the-scenes tricks, makes *5* copies of the S4 instance, though perhaps these are light-weight because the big data is in an environment). And in the mean time computers have gotten faster and bigger, and the 'big' data of ExpressionSets are now only modestly sized or even small. A somewhat different approach is in the Biostrings package, for instance DNAStringSet, where the original object is 'read-only'. The user is presented with a 'view' into the object; changing the view (subsetting) changes the indicies in the view but not the original data. This is both fast and memory efficient. This is a read-only solution, though. Hope that helps, Martin> Question :: > This implementation will violate copy on modify principle of R (since > environments are not copied), but will save a lot of memory. Do you see any > other obvious problem(s) with the idea? Have you seen a related setup > implemented / considered before (apart from the packages like filehash, ff, > and database related ones for saving memory)? > > > Implementation code snippet :: > ### The S4 class. > setClass('DataFrame', > representation(data = 'data.frame', nrow = 'numeric', ncol > 'numeric', store = 'environment'), > prototype(data = data.frame( ), nrow = 0, ncol = 0)) > > setMethod('initialize', 'DataFrame', function(.Object) { > .Object <- callNextMethod( ) > .Object at store <- new.env(hash = TRUE) > assign('data', as.list(.Object at data), .Object at store) > .Object at nrow <- nrow(.Object at data) > .Object at ncol <- ncol(.Object at data) > .Object at data <- data.frame( ) > .Object > }) > > > ### Usage: > nn <- 10 > ## dd1 below could possibly be created by read.table or scan and data.frame > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) > dd2 <- new('DataFrame', data = dd1) > rm(dd1) > ## Now work with dd2 > > > Thanks a lot, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793