Henrik Bengtsson
2019-May-20 23:48 UTC
[Rd] WISH: Built-in R session-specific universally unique identifier (UUID)
# Proposal Provide a built-in mechanism for obtaining an identifier for the current R session, e.g.> Sys.info()[["session_uuid"]][1] "4258db4d-d4fb-46b3-a214-8c762b99a443" The identifier should be "unique" in the sense that the probability for two R sessions(*) having the same identifier should be extremely small. There's no need for reproducibility, i.e. the algorithm for producing the identifier may be changed at any time. (*) Two R sessions running at different times (seconds, minutes, days, years, ...) or on different machines (locally or anywhere in the world). # Use cases In parallel-processing workflows, R objects may be "exported" (serialized) to background R processes ("workers") for further processing. In other workflows, objects may be saved to file to be reloaded in a future R session. However, certain types of objects in R maybe only be relevant, or valid, in the R session that created them. Attempts to use them in other R processes may give an obscure error or in the worst case produce garbage results. Having an identifier that is unique to each R process will make it possible to detect when an object is used in the wrong context. This can be done by attaching the session identifier to the object. For example, obj <- 42L attr(obj, "owner") <- Sys.info()[["session_uuid"]] With this, it is easy to validate the "ownership" later; stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]])) I argue that such an identifier should be part of base R for easy access and avoid each developer having to roll their own. # Possible implementation One proposal would be to bring in Simon Urbanek's 'uuid' package (https://cran.r-project.org/package=uuid) into base R. This package provides:> uuid::UUIDgenerate()[1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb" based on Theodore Ts'o's libuuid (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/). From 'man uuid_generate': "The uuid_generate function creates a new universally unique identifier (UUID). The uuid will be generated based on high-quality randomness from /dev/urandom, if available. If it is not available, then uuid_generate will use an alternative algorithm which uses the current time, the local ethernet MAC address (if available), and random data generated using a pseudo-random generator. [...] The UUID is 16 bytes (128 bits) long, which gives approximately 3.4x10^38 unique values (there are approximately 10^80 elementary particles in the universe according to Carl Sagan's Cosmos). The new UUID can reasonably be considered unique among all UUIDs created on the local system, and among UUIDs created on other systems in the past and in the future." An alternative, that does not require adding a dependency on the libuuid library, would be to roll a poor man's version based on a set of semi-unique attributes, e.g. make_id <- function(...) { args <- list(...) saveRDS(args, file = f <- tempfile()) on.exit(file.remove(f)) unname(tools::md5sum(f)) } session_id <- local({ id <- NULL function() { if (is.null(id)) { id <<- make_id( info = Sys.info(), pid = Sys.getpid(), tempdir = tempdir(), time = Sys.time(), random = sample.int(.Machine$integer.max, size = 1L) ) } id } }) Example:> session_id()[1] "8d00b17384e69e7c9ecee47e0426b2a5"> session_id()[1] "8d00b17384e69e7c9ecee47e0426b2a5" /Henrik PS. Having a built-in make_id() function would be handy too, e.g. when creating object-specific identifiers for other purposes. PPS. It would be neat if there was an object, or connection, interface for tools::md5sum(), which currently only operates on files sitting on the file system. The digest package provides this functionality.
William Dunlap
2019-May-21 00:42 UTC
[Rd] WISH: Built-in R session-specific universally unique identifier (UUID)
I think a machine-specific input, like the MAC address, to the UUID is essential. S+ used to make a seed for the random number generator based on the the current time and process ID. A customer complained that all machines in his cluster generated the same random number stream. The machines were rebooted each night, simultaneously, and S+ was started during the boot process so times and process ids were identical, hence the seeds were identical. Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, May 20, 2019 at 4:48 PM Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:> # Proposal > > Provide a built-in mechanism for obtaining an identifier for the > current R session, e.g. > > > Sys.info()[["session_uuid"]] > [1] "4258db4d-d4fb-46b3-a214-8c762b99a443" > > The identifier should be "unique" in the sense that the probability > for two R sessions(*) having the same identifier should be extremely > small. There's no need for reproducibility, i.e. the algorithm for > producing the identifier may be changed at any time. > > (*) Two R sessions running at different times (seconds, minutes, days, > years, ...) or on different machines (locally or anywhere in the > world). > > > # Use cases > > In parallel-processing workflows, R objects may be "exported" > (serialized) to background R processes ("workers") for further > processing. In other workflows, objects may be saved to file to be > reloaded in a future R session. However, certain types of objects in > R maybe only be relevant, or valid, in the R session that created > them. Attempts to use them in other R processes may give an obscure > error or in the worst case produce garbage results. > > Having an identifier that is unique to each R process will make it > possible to detect when an object is used in the wrong context. This > can be done by attaching the session identifier to the object. For > example, > > obj <- 42L > attr(obj, "owner") <- Sys.info()[["session_uuid"]] > > With this, it is easy to validate the "ownership" later; > > stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]])) > > I argue that such an identifier should be part of base R for easy > access and avoid each developer having to roll their own. > > > # Possible implementation > > One proposal would be to bring in Simon Urbanek's 'uuid' package > (https://cran.r-project.org/package=uuid) into base R. This package > provides: > > > uuid::UUIDgenerate() > [1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb" > > based on Theodore Ts'o's libuuid > (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/). From > 'man uuid_generate': > > "The uuid_generate function creates a new universally unique > identifier (UUID). The uuid will be generated based on high-quality > randomness from /dev/urandom, if available. If it is not available, > then uuid_generate will use an alternative algorithm which uses the > current time, the local ethernet MAC address (if available), and > random data generated using a pseudo-random generator. > [...] > The UUID is 16 bytes (128 bits) long, which gives approximately > 3.4x10^38 unique values (there are approximately 10^80 elementary > particles in the universe according to Carl Sagan's Cosmos). The new > UUID can reasonably be considered unique among all UUIDs created on > the local system, and among UUIDs created on other systems in the past > and in the future." > > An alternative, that does not require adding a dependency on the > libuuid library, would be to roll a poor man's version based on a set > of semi-unique attributes, e.g. > > make_id <- function(...) { > args <- list(...) > saveRDS(args, file = f <- tempfile()) > on.exit(file.remove(f)) > unname(tools::md5sum(f)) > } > > session_id <- local({ > id <- NULL > function() { > if (is.null(id)) { > id <<- make_id( > info = Sys.info(), > pid = Sys.getpid(), > tempdir = tempdir(), > time = Sys.time(), > random = sample.int(.Machine$integer.max, size = 1L) > ) > } > id > } > }) > > Example: > > > session_id() > [1] "8d00b17384e69e7c9ecee47e0426b2a5" > > > session_id() > [1] "8d00b17384e69e7c9ecee47e0426b2a5" > > /Henrik > > PS. Having a built-in make_id() function would be handy too, e.g. when > creating object-specific identifiers for other purposes. > > PPS. It would be neat if there was an object, or connection, interface > for tools::md5sum(), which currently only operates on files sitting on > the file system. The digest package provides this functionality. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]