Tony Plate
2007-Sep-10 21:16 UTC
[R] [R-pkgs] new package 'trackObjs' - mirror objects to files, provide summaries & modification times
From ?trackObjs: Overview of trackObjs package Description: The trackObjs package sets up a link between R objects in memory and files on disk so that objects are automatically resaved to files when they are changed. R objects in files are read in on demand and do not consume memory prior to being referenced. The trackObjs package also tracks times when objects are created and modified, and caches some basic characteristics of objects to allow for fast summaries of objects. Each object is stored in a separate RData file using the standard format as used by 'save()', so that objects can be manually picked out of or added to the trackObjs database if needed. Tracking works by replacing a tracked variable by an 'activeBinding', which when accessed looks up information in an associated 'tracking environment' and reads or writes the corresponding RData file and/or gets or assigns the variable in the tracking environment. Details: There are three main reasons to use the 'trackObjs' package: * conveniently handle many moderately-large objects that would collectively exhaust memory or be inconvenient to manage in files by manually using 'save()' and 'load()' * keep track of creation and modification times on objects * get fast summaries of basic characteristics of objects - class, size, dimension, etc. There is an option to control whether tracked objects are cached in memory as well as being stored on disk. By default, objects are not cached. To save time when working with collections of objects that will all fit in memory, turn on caching with 'track.options(cache=TRUE)', or start tracking with 'track.start(..., cache=TRUE)'. Here is a brief example of tracking some variables in the global environment: > library(trackObjs) > track.start("tmp1") > x <- 123 # Not yet tracked > track(x) # Variable 'x' is now tracked > track(y <- matrix(1:6, ncol=2)) # 'y' is assigned & tracked > z1 <- list("a", "b", "c") > z2 <- Sys.time() > track(list=c("z1", "z2")) # Track a bunch of variables > track.summary(size=F) # See a summary of tracked vars class mode extent length modified TA TW x numeric numeric [1] 1 2007-09-07 08:50:58 0 1 y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1 z1 list list [[3]] 3 2007-09-07 08:50:58 0 1 z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1 > # (TA="total accesses", TW="total writes") > ls(all=TRUE) [1] "x" "y" "z1" "z2" > track.stop() # Stop tracking > ls(all=TRUE) character(0) > > # Restart using the tracking dir -- the variables reappear > track.start("tmp1") # Start using the tracking dir again > ls(all=TRUE) [1] "x" "y" "z1" "z2" > track.summary(size=F) class mode extent length modified TA TW x numeric numeric [1] 1 2007-09-07 08:50:58 0 1 y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1 z1 list list [[3]] 3 2007-09-07 08:50:58 0 1 z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1 > track.stop() > > # the files in the tracking directory: > list.files("tmp1", all=TRUE) [1] "." ".." [3] "filemap.txt" ".trackingSummary.rda" [5] "x.rda" "y.rda" [7] "z1.rda" "z2.rda" > There are several points to note: * The global environment is the default environment for tracking - it is possible to track variables in other environments, but that environment must be supplied as an argument to the track functions. * Vars must be explicitly 'track()'ed - newly created objects are not tracked. (This is not a "feature", but there is currently no way of automatically tracking newly created objects - this is on the wishlist.) Thus, it is possible for variables in a tracked environment to either tracked or untracked. * When tracking is stopped, all tracked variables are saved on disk and will be no longer accessible until tracking is started again. * The objects are stored each in their own file in the tracking dir, in the format used by 'save()'/'load()' (RData files). List of basic functions and common calling patterns: Six functions cover the majority of common usage of the trackObjs package: * 'track.start(dir=...)': start tracking the global environment, with files saved in 'dir' * 'track.stop()': stop tracking (any unsaved tracked variables are saved to disk and all tracked variables become unavailable until tracking starts again) * 'track(x)': start tracking 'x' - 'x' in the global environment is replaced by an active binding and 'x' is saved in its corresponding file in the tracking directory and, if caching is on, in the tracking environment * 'track(x <- value)': start tracking 'x' * 'track(list=c('x', 'y'))': start tracking specified variables * 'track(all=TRUE)': start tracking all untracked variables in the global environment * 'untrack(x)': stop tracking variable 'x' - the R object 'x' is put back as an ordinary object in the global environment * 'untrack(all=TRUE)': stop tracking all variables in the global environment (but tracking is still set up) * 'untrack(list=...)': stop tracking specified variables * 'track.summary()': print a summary of the basic characteristics of tracked variables: name, class, extent, and creation, modification and access times. * 'track.remove(x)': completely remove all traces of 'x' from the global environment, tracking environment and tracking directory. Note that if variable 'x' in the global environment is tracked, 'remove(x)' will make 'x' an "orphaned" variable: 'remove(x)' will just remove the active binding from the global environment, and leave 'x' in the tracked environment and on file, and 'x' will reappear after restarting tracking. Complete list of functions and common calling patterns: The 'trackObjs' package provides many additional functions for controlling how tracking is performed (e.g., whether or not tracked variables are cached in memory), examining the state of tracking (show which variables are tracked, untracked, orphaned, masked, etc.) and repairing tracking environments and databases that have become inconsistent or incomplete (this may result from resource limitiations, e.g., being unable to write a save file due to lack of disk space, or from manual tinkering, e.g., dropping a new save file into a tracking directory.) [truncated here -- see ?trackObjs] -- Tony Plate PS: to give credit where due, the end of ?trackObjs says: References: Roger D. Peng. Interacting with data using the filehash package. R News, 6(4):19-24, October 2006. 'http://cran.r-project.org/doc/Rnews' and 'http://sandybox.typepad.com/software' David E. Brahm. Delayed data packages. R News, 2(3):11-12, December 2002. 'http://cran.r-project.org/doc/Rnews' See Also: [...] Inspriation from the packages 'g.data' and 'filehash'. _______________________________________________ R-packages mailing list R-packages at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-packages