Henrik Bengtsson
2007-Mar-07 22:24 UTC
[Rd] Small inconsistency in serialize() between R versions and implications on digest()
Hi, I noticed that serialize() gives different results depending on R version, which has implications to the digest() function in the digest package. Note, it does give the same output across platforms. I know that serialize() is under development, but is this expected, e.g. is there some kind of header in the result that specifies "who" generated the stream, and if so, exactly what bytes are they? SETUP: R versions: A) R v2.4.0 (2006-10-03) B) R v2.4.1pat (2007-01-13 r40470) C) R v2.5.0dev (2006-12-12 r40167) This is on WinXP and I start R with Rterm --vanilla. Example: Identical serialize() calls using the different R versions.> raw <- serialize(1, connection=NULL, ascii=TRUE) > print(raw)gives: (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a Note the difference in raw bytes 8 to 10, i.e.> raw[7:11](A): [1] 32 30 39 36 0a (B): [1] 32 30 39 37 0a (C): [1] 32 33 35 32 0a Does bytes 8, 9 and 10 in the raw vector somehow contain information about the R version or similar? The following poor mans test says that is the only difference: On all R versions, the following gives identical results:> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) > raw <- as.integer(raw[-c(8:10)]) > sum(raw)[1] 2147884> sum(log(raw))[1] 177201.2 If it is true that there is a R version specific header in serialized objects, then the digest() function should exclude such header in order to produce consistent results across R versions, because now digest(1) gives different results. Thank you Henrik
Henrik Bengtsson
2007-Mar-08 04:11 UTC
[Rd] Small inconsistency in serialize() between R versions and implications on digest()
To follow up, I went ahead and generated "random" object to scan for a common header for a given R version, and it seems to be that at most the first 18 bytes are non-data specific, which could be the length of the serialization header. Here is my code for this: scanSerialize <- function(object, hdr=NULL, ...) { # Serialize object raw <- serialize(object, connection=NULL, ascii=TRUE); # First run? if (is.null(hdr)) return(raw); # Find differences between current longest header and new raw vector n <- length(hdr); diffs <- (as.integer(hdr) != as.integer(raw[1:n])); # No differences? if (!any(diffs)) return(hdr); # Position of first difference idx <- which(diffs)[1]; # Keep common header hdr <- hdr[seq_len(idx-1)]; hdr; }; # Serialize a first "random" object hdr <- scanSerialize(NA); for (kk in 1:100) hdr <- scanSerialize(kk, hdr=hdr); for (kk in 1:100) { x <- sample(letters, size=sample(100), replace=TRUE); hdr <- scanSerialize(x, hdr=hdr); } for (kk in 1:100) { hdr <- scanSerialize(kk, hdr=hdr); hdr <- scanSerialize(hdr, hdr=hdr); } cat("Length:", length(hdr), "\n"); print(hdr); print(rawToChar(hdr)); On R v2.5.0 devel, this gives: Length: 18 [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a [1] "A\n2\n132352\n131840\n" However, it would still be good to get an "official" statement from one in the R-code team about the serialization header and where the data section start. Again, I want to cut out as much as possible for consistency between R version without loosing data dependent bytes. Thanks /Henrik On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:> Hi, > > I noticed that serialize() gives different results depending on R > version, which has implications to the digest() function in the digest > package. Note, it does give the same output across platforms. I know > that serialize() is under development, but is this expected, e.g. is > there some kind of header in the result that specifies "who" generated > the stream, and if so, exactly what bytes are they? > > SETUP: > > R versions: > A) R v2.4.0 (2006-10-03) > B) R v2.4.1pat (2007-01-13 r40470) > C) R v2.5.0dev (2006-12-12 r40167) > > This is on WinXP and I start R with Rterm --vanilla. > > Example: Identical serialize() calls using the different R versions. > > > raw <- serialize(1, connection=NULL, ascii=TRUE) > > print(raw) > > gives: > > (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > > Note the difference in raw bytes 8 to 10, i.e. > > > raw[7:11] > (A): [1] 32 30 39 36 0a > (B): [1] 32 30 39 37 0a > (C): [1] 32 33 35 32 0a > > Does bytes 8, 9 and 10 in the raw vector somehow contain information > about the R version or similar? The following poor mans test says > that is the only difference: > > On all R versions, the following gives identical results: > > > raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) > > raw <- as.integer(raw[-c(8:10)]) > > sum(raw) > [1] 2147884 > > sum(log(raw)) > [1] 177201.2 > > If it is true that there is a R version specific header in serialized > objects, then the digest() function should exclude such header in > order to produce consistent results across R versions, because now > digest(1) gives different results. > > Thank you > > Henrik >