Zafer Barutcuoglu
2021-Jul-03 03:35 UTC
[Rd] Clearing attributes returns ALTREP, serialize still saves them
Hi all, Setting names/dimnames on vectors/matrices of length>=64 returns an ALTREP wrapper which internally still contains the names/dimnames, and calling base::serialize on the result writes them out. They are unserialized in the same way, with the names/dimnames hidden in the ALTREP wrapper, so the problem is not obvious except in wasted time, bandwidth, or disk space. Example: v1 <- setNames(rnorm(64), paste("element name", 1:64)) v2 <- unname(v1) names(v2) # NULL length(serialize(v1, NULL)) # [1] 2039 length(serialize(v2, NULL)) # [1] 2132 length(serialize(v2[TRUE], NULL)) # [1] 543 con <- rawConnection(raw(), "w") serialize(v2, con) v3 <- unserialize(rawConnectionValue(con)) names(v3) # NULL length(serialize(v3, NULL)) # 2132 # Similarly for matrices: m1 <- matrix(rnorm(64), 8, 8, dimnames=list(paste("row name", 1:8), paste("col name", 1:8))) m2 <- unname(m1) dimnames(m2) # NULL length(serialize(m1, NULL)) # [1] 918 length(serialize(m2, NULL)) # [1] 1035 length(serialize(m2[TRUE, TRUE], NULL)) # 582 Previously discussed here, too: https://r.789695.n4.nabble.com/Invisible-names-problem-td4764688.html This happens with other attributes as well, but less predictably: x1 <- structure(rnorm(100), data=rnorm(1000000)) x2 <- structure(x1, data=NULL) length(serialize(x1, NULL)) # [1] 8000952 length(serialize(x2, NULL)) # [1] 924 x1b <- rnorm(100) attr(x1b, "data") <- rnorm(1000000) x2b <- x1b attr(x2b, "data") <- NULL length(serialize(x1b, NULL)) # [1] 8000863 length(serialize(x2b, NULL)) # [1] 8000956 This is pretty severe, trying to track down why serializing a small object kills the network, because of which large attributes it may have once had during its lifetime around the codebase that are still secretly tagging along. Is there a plan to resolve this? Any suggestions for maybe a C++ workaround until then? Or an alternative performant serialization solution? Best, -- Zafer [[alternative HTML version deleted]]
Gabriel Becker
2021-Jul-03 05:18 UTC
[Rd] Clearing attributes returns ALTREP, serialize still saves them
Hi all, I don't have a solution yet, but a bit more here:> .Internal(inspect(x2b))@7f913826d590 14 REALSXP g0c0 [REF(1)] wrapper [srt=-2147483648,no_na=0] @7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0) 0.45384,0.926371,0.838637,-1.71485,-0.719073,... ATTRIB: @7f913826dc20 02 LISTSXP g0c0 [REF(1)] TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(460)] "data" @7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0) 0.66682,0.480576,-1.13229,0.453313,-0.819498,...> attr(x2b, "data") <- "small"> .Internal(inspect(x2b))@7f913826d590 14 REALSXP g0c0 [REF(1),ATT] wrapper [srt=-2147483648,no_na=0] @7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0) 0.45384,0.926371,0.838637,-1.71485,-0.719073,... ATTRIB: @7f913826dc20 02 LISTSXP g0c0 [REF(1)] TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data" @7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0) 0.66682,0.480576,-1.13229,0.453313,-0.819498,... ATTRIB: @7f913826c870 02 LISTSXP g0c0 [REF(1)] TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data" @7f9120580850 16 STRSXP g0c1 [REF(3)] (len=1, tl=0) @7f91205808c0 09 CHARSXP g0c1 [REF(3),gp=0x60] [ASCII] [cached] "small" So we can see that the assignment of attr(x2b, "data") IS doing something, but it isn't doing the right thing. The fact that the above code assigned null instead of a value was hiding this. I will dig into this more if someone doesn't get it fixed before me, but it won't be until after useR, because I'm preparing multiple talks for that and it is this coming week. Best, ~G On Fri, Jul 2, 2021 at 9:15 PM Zafer Barutcuoglu < zafer.barutcuoglu at gmail.com> wrote:> Hi all, > > Setting names/dimnames on vectors/matrices of length>=64 returns an ALTREP > wrapper which internally still contains the names/dimnames, and calling > base::serialize on the result writes them out. They are unserialized in the > same way, with the names/dimnames hidden in the ALTREP wrapper, so the > problem is not obvious except in wasted time, bandwidth, or disk space. > > Example: > v1 <- setNames(rnorm(64), paste("element name", 1:64)) > v2 <- unname(v1) > names(v2) > # NULL > length(serialize(v1, NULL)) > # [1] 2039 > length(serialize(v2, NULL)) > # [1] 2132 > length(serialize(v2[TRUE], NULL)) > # [1] 543 > > con <- rawConnection(raw(), "w") > serialize(v2, con) > v3 <- unserialize(rawConnectionValue(con)) > names(v3) > # NULL > length(serialize(v3, NULL)) > # 2132 > > # Similarly for matrices: > m1 <- matrix(rnorm(64), 8, 8, dimnames=list(paste("row name", 1:8), > paste("col name", 1:8))) > m2 <- unname(m1) > dimnames(m2) > # NULL > length(serialize(m1, NULL)) > # [1] 918 > length(serialize(m2, NULL)) > # [1] 1035 > length(serialize(m2[TRUE, TRUE], NULL)) > # 582 > > Previously discussed here, too: > https://r.789695.n4.nabble.com/Invisible-names-problem-td4764688.html > > This happens with other attributes as well, but less predictably: > x1 <- structure(rnorm(100), data=rnorm(1000000)) > x2 <- structure(x1, data=NULL) > length(serialize(x1, NULL)) > # [1] 8000952 > length(serialize(x2, NULL)) > # [1] 924 > > x1b <- rnorm(100) > attr(x1b, "data") <- rnorm(1000000) > x2b <- x1b > attr(x2b, "data") <- NULL > length(serialize(x1b, NULL)) > # [1] 8000863 > length(serialize(x2b, NULL)) > # [1] 8000956 > > This is pretty severe, trying to track down why serializing a small object > kills the network, because of which large attributes it may have once had > during its lifetime around the codebase that are still secretly tagging > along. > > Is there a plan to resolve this? Any suggestions for maybe a C++ > workaround until then? Or an alternative performant serialization solution? > > Best, > -- > Zafer > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
iuke-tier@ey m@iii@g oii uiow@@edu
2021-Jul-03 13:40 UTC
[Rd] [External] Clearing attributes returns ALTREP, serialize still saves them
Please do not cross post. You have already rased this on bugzilla. I will follow up there later today. luke On Sat, 3 Jul 2021, Zafer Barutcuoglu wrote:> Hi all, > > Setting names/dimnames on vectors/matrices of length>=64 returns an ALTREP wrapper which internally still contains the names/dimnames, and calling base::serialize on the result writes them out. They are unserialized in the same way, with the names/dimnames hidden in the ALTREP wrapper, so the problem is not obvious except in wasted time, bandwidth, or disk space. > > Example: > v1 <- setNames(rnorm(64), paste("element name", 1:64)) > v2 <- unname(v1) > names(v2) > # NULL > length(serialize(v1, NULL)) > # [1] 2039 > length(serialize(v2, NULL)) > # [1] 2132 > length(serialize(v2[TRUE], NULL)) > # [1] 543 > > con <- rawConnection(raw(), "w") > serialize(v2, con) > v3 <- unserialize(rawConnectionValue(con)) > names(v3) > # NULL > length(serialize(v3, NULL)) > # 2132 > > # Similarly for matrices: > m1 <- matrix(rnorm(64), 8, 8, dimnames=list(paste("row name", 1:8), paste("col name", 1:8))) > m2 <- unname(m1) > dimnames(m2) > # NULL > length(serialize(m1, NULL)) > # [1] 918 > length(serialize(m2, NULL)) > # [1] 1035 > length(serialize(m2[TRUE, TRUE], NULL)) > # 582 > > Previously discussed here, too: > https://r.789695.n4.nabble.com/Invisible-names-problem-td4764688.html > > This happens with other attributes as well, but less predictably: > x1 <- structure(rnorm(100), data=rnorm(1000000)) > x2 <- structure(x1, data=NULL) > length(serialize(x1, NULL)) > # [1] 8000952 > length(serialize(x2, NULL)) > # [1] 924 > > x1b <- rnorm(100) > attr(x1b, "data") <- rnorm(1000000) > x2b <- x1b > attr(x2b, "data") <- NULL > length(serialize(x1b, NULL)) > # [1] 8000863 > length(serialize(x2b, NULL)) > # [1] 8000956 > > This is pretty severe, trying to track down why serializing a small object kills the network, because of which large attributes it may have once had during its lifetime around the codebase that are still secretly tagging along. > > Is there a plan to resolve this? Any suggestions for maybe a C++ workaround until then? Or an alternative performant serialization solution? > > Best, > -- > Zafer > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu