thr3ads.net - R devel - [Rd] Small inconsistency in serialize() between R versions and implications on digest() [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2007-Mar-07 22:24 UTC

[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Hi,

I noticed that serialize() gives different results depending on R
version, which has implications to the digest() function in the digest
package.  Note, it does give the same output across platforms.  I know
that serialize() is under development, but is this expected, e.g. is
there some kind of header in the result that specifies "who" generated
the stream, and if so, exactly what bytes are they?

SETUP:

R versions:
A) R v2.4.0 (2006-10-03)
B) R v2.4.1pat (2007-01-13 r40470)
C) R v2.5.0dev (2006-12-12 r40167)

This is on WinXP and I start R with Rterm --vanilla.

Example: Identical serialize() calls using the different R versions.
> raw <- serialize(1, connection=NULL, ascii=TRUE)
> print(raw)
gives:

(A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
0a 31 0a 31 0a
(B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
0a 31 0a 31 0a
(C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
0a 31 0a 31 0a

Note the difference in raw bytes 8 to 10, i.e.
> raw[7:11](A): [1] 32 30 39 36 0a
(B): [1] 32 30 39 37 0a
(C): [1] 32 33 35 32 0a

Does bytes 8, 9 and 10 in the raw vector somehow contain information
about the R version or similar?  The following poor mans test says
that is the only difference:

On all R versions, the following gives identical results:
> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
> raw <- as.integer(raw[-c(8:10)])
> sum(raw)
[1] 2147884> sum(log(raw))[1] 177201.2

If it is true that there is a R version specific header in serialized
objects, then the digest() function should exclude such header in
order to produce consistent results across R versions, because now
digest(1) gives different results.

Thank you

Henrik

Henrik Bengtsson

2007-Mar-08 04:11 UTC

head link

[Rd] Small inconsistency in serialize() between R versions and implications on digest()

To follow up, I went ahead and generated "random" object to scan for a
common header for a given R version, and it seems to be that at most
the first 18 bytes are non-data specific, which could be the length of
the serialization header.

Here is my code for this:

scanSerialize <- function(object, hdr=NULL, ...) {
  # Serialize object
  raw <- serialize(object, connection=NULL, ascii=TRUE);

  # First run?
  if (is.null(hdr))
    return(raw);

  # Find differences between current longest header and new raw vector
  n <- length(hdr);
  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));

  # No differences?
  if (!any(diffs))
    return(hdr);

  # Position of first difference
  idx <- which(diffs)[1];

  # Keep common header
  hdr <- hdr[seq_len(idx-1)];

  hdr;
};

# Serialize a first "random" object
hdr <- scanSerialize(NA);
for (kk in 1:100)
  hdr <- scanSerialize(kk, hdr=hdr);
for (kk in 1:100) {
  x <- sample(letters, size=sample(100), replace=TRUE);
  hdr <- scanSerialize(x, hdr=hdr);
}
for (kk in 1:100) {
  hdr <- scanSerialize(kk, hdr=hdr);
  hdr <- scanSerialize(hdr, hdr=hdr);
}

cat("Length:", length(hdr), "\n");
print(hdr);
print(rawToChar(hdr));

On R v2.5.0 devel, this gives:
Length: 18
 [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
[1] "A\n2\n132352\n131840\n"

However, it would still be good to get an "official" statement from
one in the R-code team about the serialization header and where the
data section start.  Again, I want to cut out as much as possible for
consistency between R version without loosing data dependent bytes.

Thanks

/Henrik

On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu>
wrote:> Hi,
>
> I noticed that serialize() gives different results depending on R
> version, which has implications to the digest() function in the digest
> package.  Note, it does give the same output across platforms.  I know
> that serialize() is under development, but is this expected, e.g. is
> there some kind of header in the result that specifies "who"
generated
> the stream, and if so, exactly what bytes are they?
>
> SETUP:
>
> R versions:
> A) R v2.4.0 (2006-10-03)
> B) R v2.4.1pat (2007-01-13 r40470)
> C) R v2.5.0dev (2006-12-12 r40167)
>
> This is on WinXP and I start R with Rterm --vanilla.
>
> Example: Identical serialize() calls using the different R versions.
>
> > raw <- serialize(1, connection=NULL, ascii=TRUE)
> > print(raw)
>
> gives:
>
> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
>
> Note the difference in raw bytes 8 to 10, i.e.
>
> > raw[7:11]
> (A): [1] 32 30 39 36 0a
> (B): [1] 32 30 39 37 0a
> (C): [1] 32 33 35 32 0a
>
> Does bytes 8, 9 and 10 in the raw vector somehow contain information
> about the R version or similar?  The following poor mans test says
> that is the only difference:
>
> On all R versions, the following gives identical results:
>
> > raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
> > raw <- as.integer(raw[-c(8:10)])
> > sum(raw)
> [1] 2147884
> > sum(log(raw))
> [1] 177201.2
>
> If it is true that there is a R version specific header in serialized
> objects, then the digest() function should exclude such header in
> order to produce consistent results across R versions, because now
> digest(1) gives different results.
>
> Thank you
>
> Henrik
>

R devel - Mar 2007 - Small inconsistency in serialize() between R versions and implications on digest()

[Rd] Small inconsistency in serialize() between R versions and implications on digest()

[Rd] Small inconsistency in serialize() between R versions and implications on digest()