Hi, Please see below for post on r-help regarding data.frame() and the possibility of dropping rownames, for space and time reasons. I've made some changes, attached, and it seems to be working well. I see the expected space (90% saved) and time (10 times faster) savings. There are no doubt some bugs, and needs more work and testing, but I thought I would post first at this stage. Could some changes along these lines be made to R ? I'm happy to help with testing and further work if required. In the meantime I can work with overloaded functions which fixes the problems in my case. Functions effected : dim.data.frame format.data.frame print.data.frame data.frame [.data.frame as.matrix.data.frame Modified source code attached. Regards, Matthew -----Original Message----- From: Matthew Dowle Sent: 09 December 2005 09:44 To: 'Peter Dalgaard' Cc: 'r-help at stat.math.ethz.ch' Subject: RE: [R] data.frame() size That explains it. Thanks. I don't need rownames though, as I'll only ever use integer subscripts. Is there anyway to drop them, or even better not create them in the first place? The memory saved (90%) by not having them and 10 times speed up would be very useful. I think I need a data.frame rather than a matrix because I have columns of different types in real life.> rownames(d) = NULLError in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" : invalid 'dimnames' given for data frame -----Original Message----- From: pd at pubhealth.ku.dk [mailto:pd at pubhealth.ku.dk] On Behalf Of Peter Dalgaard Sent: 08 December 2005 18:57 To: Matthew Dowle Cc: 'r-help at stat.math.ethz.ch' Subject: Re: [R] data.frame() size Matthew Dowle <mdowle at concordiafunds.com> writes:> Hi, > > In the example below why is d 10 times bigger than m, according to > object.size ? It also takes around 10 times as long to create, which > fits with object.size() being truthful. gcinfo(TRUE) also indicates a > great deal more garbage collector activity caused by data.frame() than > matrix(). > > $ R --vanilla > .... > > nr = 1000000 > > system.time(m<<-matrix(integer(1), nrow=nr, ncol=2)) > [1] 0.22 0.01 0.23 0.00 0.00 > > system.time(d<<-data.frame(a=integer(nr), b=integer(nr))) > [1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer > > > dim(m) > [1] 1000000 2 > > dim(d) > [1] 1000000 2 # same dimensions > > > storage.mode(m) > [1] "integer" > > sapply(d, storage.mode) > a b > "integer" "integer" # same storage.mode > > > object.size(m)/1024^2 > [1] 7.629616 > > object.size(d)/1024^2 > [1] 76.29482 # but 10 times bigger > > > sum(sapply(d, object.size))/1024^2 > [1] 7.629501 # or is it ? If its not > really 10 times bigger, why 10 times longer above ?Row names!!> r <- as.character(1:1e6) > object.size(r)[1] 72000056> object.size(r)/1024^2[1] 68.6646 'nuff said? -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
There was nothing attached in the copy that came through to me. By the way, there was some discussion earlier this year on a light-weight data.frame class but I don't think anyone ever posted any code. On 12/9/05, Matthew Dowle <mdowle at concordiafunds.com> wrote:> > Hi, > > Please see below for post on r-help regarding data.frame() and the > possibility of dropping rownames, for space and time reasons. > I've made some changes, attached, and it seems to be working well. I see the > expected space (90% saved) and time (10 times faster) savings. There are no > doubt some bugs, and needs more work and testing, but I thought I would post > first at this stage. > > Could some changes along these lines be made to R ? I'm happy to help with > testing and further work if required. In the meantime I can work with > overloaded functions which fixes the problems in my case. > > Functions effected : > > dim.data.frame > format.data.frame > print.data.frame > data.frame > [.data.frame > as.matrix.data.frame > > Modified source code attached. > > Regards, > Matthew > > > -----Original Message----- > From: Matthew Dowle > Sent: 09 December 2005 09:44 > To: 'Peter Dalgaard' > Cc: 'r-help at stat.math.ethz.ch' > Subject: RE: [R] data.frame() size > > > > That explains it. Thanks. I don't need rownames though, as I'll only ever > use integer subscripts. Is there anyway to drop them, or even better not > create them in the first place? The memory saved (90%) by not having them > and 10 times speed up would be very useful. I think I need a data.frame > rather than a matrix because I have columns of different types in real life. > > > rownames(d) = NULL > Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" : > invalid 'dimnames' given for data frame > > > -----Original Message----- > From: pd at pubhealth.ku.dk [mailto:pd at pubhealth.ku.dk] On Behalf Of Peter > Dalgaard > Sent: 08 December 2005 18:57 > To: Matthew Dowle > Cc: 'r-help at stat.math.ethz.ch' > Subject: Re: [R] data.frame() size > > > Matthew Dowle <mdowle at concordiafunds.com> writes: > > > Hi, > > > > In the example below why is d 10 times bigger than m, according to > > object.size ? It also takes around 10 times as long to create, which > > fits with object.size() being truthful. gcinfo(TRUE) also indicates a > > great deal more garbage collector activity caused by data.frame() than > > matrix(). > > > > $ R --vanilla > > .... > > > nr = 1000000 > > > system.time(m<<-matrix(integer(1), nrow=nr, ncol=2)) > > [1] 0.22 0.01 0.23 0.00 0.00 > > > system.time(d<<-data.frame(a=integer(nr), b=integer(nr))) > > [1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer > > > > > dim(m) > > [1] 1000000 2 > > > dim(d) > > [1] 1000000 2 # same dimensions > > > > > storage.mode(m) > > [1] "integer" > > > sapply(d, storage.mode) > > a b > > "integer" "integer" # same storage.mode > > > > > object.size(m)/1024^2 > > [1] 7.629616 > > > object.size(d)/1024^2 > > [1] 76.29482 # but 10 times bigger > > > > > sum(sapply(d, object.size))/1024^2 > > [1] 7.629501 # or is it ? If its not > > really 10 times bigger, why 10 times longer above ? > > Row names!! > > > > r <- as.character(1:1e6) > > object.size(r) > [1] 72000056 > > object.size(r)/1024^2 > [1] 68.6646 > > 'nuff said? > > -- > O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 > > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > >
I believe Gabor was referring to this: http://tolstoy.newcastle.edu.au/R/devel/05/05/0837.html Andy From: Hin-Tak Leung> > Gabor Grothendieck wrote: > > There was nothing attached in the copy that came through > > to me. > > I like to see that patch also. > > > By the way, there was some discussion earlier this year > > on a light-weight data.frame class but I don't think anyone > > ever posted any code. > > It may have been me. I am working on a bit-packed data.frame > which only uses 2-bits per unit of data, so it is 4 units per RAWSXP. > (work in progress, nothing to show). > > So I am very interested to see the patch. > > Yes, I took a couple of weeks reading/learning where have all the > memory gone in data.frame. The rowname/column names allocation is > a bit stupid. Each rowname and each column name is a full > R object, so there is a 32(or 28) byte overhead just from managing > that, before the STRSXP for the actual string, which is > another X bytes. > so for an 1 x N data.frame with integers for content, the > the content is 4-byte * N, but the rowname/columnname is 32 * N -ish. > (a 9x increase). Word is 32-bit on most people's machines, and > I am counting the extra one from which you have to keep the address > of each SEXPREC somewhere, so it is 7+1 = 8, if I understand > it correctly. > > Here is the relevant comment, quoted verbatum from around line 225 of > "src/include/Rinternals.h": > > /* The generational collector uses a reduced version of SEXPREC as a > header in vector nodes. The layout MUST be kept consistent with > the SEXPREC definition. The standard SEXPREC takes up 7 words on > most hardware; this reduced version should take up only 6 words. > In addition to slightly reducing memory use, this can lead to more > favorable data alignment on 32-bit architectures like the Intel > Pentium III where odd word alignment of doubles is > allowed but much > less efficient than even word alignment. */ > > Hin-Tak Leung > > > On 12/9/05, Matthew Dowle <mdowle at concordiafunds.com> wrote: > > > >>Hi, > >> > >>Please see below for post on r-help regarding data.frame() and the > >>possibility of dropping rownames, for space and time reasons. > >>I've made some changes, attached, and it seems to be > working well. I see the > >>expected space (90% saved) and time (10 times faster) > savings. There are no > >>doubt some bugs, and needs more work and testing, but I > thought I would post > >>first at this stage. > >> > >>Could some changes along these lines be made to R ? I'm > happy to help with > >>testing and further work if required. In the meantime I can > work with > >>overloaded functions which fixes the problems in my case. > >> > >>Functions effected : > >> > >> dim.data.frame > >> format.data.frame > >> print.data.frame > >> data.frame > >> [.data.frame > >> as.matrix.data.frame > >> > >>Modified source code attached. > >> > >>Regards, > >>Matthew > >> > >> > >>-----Original Message----- > >>From: Matthew Dowle > >>Sent: 09 December 2005 09:44 > >>To: 'Peter Dalgaard' > >>Cc: 'r-help at stat.math.ethz.ch' > >>Subject: RE: [R] data.frame() size > >> > >> > >> > >>That explains it. Thanks. I don't need rownames though, as > I'll only ever > >>use integer subscripts. Is there anyway to drop them, or > even better not > >>create them in the first place? The memory saved (90%) by > not having them > >>and 10 times speed up would be very useful. I think I need > a data.frame > >>rather than a matrix because I have columns of different > types in real life. > >> > >> > >>>rownames(d) = NULL > >> > >>Error in "dimnames<-.data.frame"(`*tmp*`, value = > list(NULL, c("a", "b" : > >> invalid 'dimnames' given for data frame > >> > >> > >>-----Original Message----- > >>From: pd at pubhealth.ku.dk [mailto:pd at pubhealth.ku.dk] On > Behalf Of Peter > >>Dalgaard > >>Sent: 08 December 2005 18:57 > >>To: Matthew Dowle > >>Cc: 'r-help at stat.math.ethz.ch' > >>Subject: Re: [R] data.frame() size > >> > >> > >>Matthew Dowle <mdowle at concordiafunds.com> writes: > >> > >> > >>>Hi, > >>> > >>>In the example below why is d 10 times bigger than m, according to > >>>object.size ? It also takes around 10 times as long to > create, which > >>>fits with object.size() being truthful. gcinfo(TRUE) also > indicates a > >>>great deal more garbage collector activity caused by > data.frame() than > >>>matrix(). > >>> > >>>$ R --vanilla > >>>.... > >>> > >>>>nr = 1000000 > >>>>system.time(m<<-matrix(integer(1), nrow=nr, ncol=2)) > >>> > >>>[1] 0.22 0.01 0.23 0.00 0.00 > >>> > >>>>system.time(d<<-data.frame(a=integer(nr), b=integer(nr))) > >>> > >>>[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer > >>> > >>> > >>>>dim(m) > >>> > >>>[1] 1000000 2 > >>> > >>>>dim(d) > >>> > >>>[1] 1000000 2 # same dimensions > >>> > >>> > >>>>storage.mode(m) > >>> > >>>[1] "integer" > >>> > >>>>sapply(d, storage.mode) > >>> > >>> a b > >>>"integer" "integer" # same storage.mode > >>> > >>> > >>>>object.size(m)/1024^2 > >>> > >>>[1] 7.629616 > >>> > >>>>object.size(d)/1024^2 > >>> > >>>[1] 76.29482 # but 10 times bigger > >>> > >>> > >>>>sum(sapply(d, object.size))/1024^2 > >>> > >>>[1] 7.629501 # or is it ? > If its not > >>>really 10 times bigger, why 10 times longer above ? > >> > >>Row names!! > >> > >> > >> > >>>r <- as.character(1:1e6) > >>>object.size(r) > >> > >>[1] 72000056 > >> > >>>object.size(r)/1024^2 > >> > >>[1] 68.6646 > >> > >>'nuff said? > >> > >>-- > >> O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B > >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > >> (*) \(*) -- University of Copenhagen Denmark > Ph: (+45) 35327918 > >>~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) > FAX: (+45) 35327907 > >> > >> > >> > >> > >>______________________________________________ > >>R-devel at r-project.org mailing list > >>https://stat.ethz.ch/mailman/listinfo/r-devel > >> > >> > >> > > > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >
I guess the mail list precludes attachments then, makes sense. I have sent the modified source directly to anyone who has asked. I had a look at the light-weight data.frame class post (http://tolstoy.newcastle.edu.au/R/devel/05/05/0837.html) :> Now the transcript itself: > # the motivation: subscription of a data.frame is *much* (almost 20times) slower than that of a list> # compare > n = 1e6 > i = seq(n) > x = data.frame(a=seq(n), b=seq(n)) > system.time(x[i,], gcFirst=TRUE)[1] 1.01 0.14 1.14 0.00 0.00> > x = list(a=seq(n), b=seq(n)) > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)[1] 0.06 0.00 0.06 0.00 0.00> > # the solution: define methods for the light-weight data.frame class > lwdf = function(...) structure(list(...), class = "lwdf") > ...But if I have understood correctly I think the time difference here is just down to the rownames. The rownames are 1:n stored in character form. This takes the most time and space in this example, but are never used. I'm not sure why 1:n in character form would ever be useful in fact. Running the example above with my modifications appears to fix the problem ie negligible time difference. I needed to make a one line change to [.data.frame, and I've sent that to anyone who requested the code. I can see the problem :> apropos("data.frame")[1] "[.data.frame" "as.matrix.data.frame" "data.frame" "dim.data.frame" [5] "format.data.frame" "print.data.frame" ".__C__data.frame" "aggregate.data.frame" [9] "$<-.data.frame" "Math.data.frame" "Ops.data.frame" "Summary.data.frame" [13] "[.data.frame" "[<-.data.frame" "[[.data.frame" "[[<-.data.frame" [17] "as.data.frame" "as.data.frame.AsIs" "as.data.frame.Date" "as.data.frame.POSIXct" [21] "as.data.frame.POSIXlt" "as.data.frame.array" "as.data.frame.character" "as.data.frame.complex" [25] "as.data.frame.data.frame" "as.data.frame.default" "as.data.frame.factor" "as.data.frame.integer" [29] "as.data.frame.list" "as.data.frame.logical" "as.data.frame.matrix" "as.data.frame.model.matrix" [33] "as.data.frame.numeric" "as.data.frame.ordered" "as.data.frame.package_version" "as.data.frame.raw" [37] "as.data.frame.table" "as.data.frame.ts" "as.data.frame.vector" "as.list.data.frame" [41] "as.matrix.data.frame" "by.data.frame" "cbind.data.frame" "data.frame" [45] "dim.data.frame" "dimnames.data.frame" "dimnames<-.data.frame" "duplicated.data.frame" [49] "format.data.frame" "is.data.frame" "is.na.data.frame" "mean.data.frame" [53] "merge.data.frame" "print.data.frame" "rbind.data.frame" "row.names.data.frame" [57] "row.names<-.data.frame" "rowsum.data.frame" "split.data.frame" "split<-.data.frame" [61] "stack.data.frame" "subset.data.frame" "summary.data.frame" "t.data.frame" [65] "transform.data.frame" "unique.data.frame" "unstack.data.frame" "xpdrows.data.frame">But I think the changes would be quick to make. Is anything else effected? Do any test suites exist to confirm R hasn't broken? On the face of it allowing data frames to have null row names seems a small change, and would make them consistent with matrices, with large time and space benefits. However, I can see the argument for a new class instead for safety. Whats the consenus? -----Original Message----- From: Hin-Tak Leung [mailto:hin-tak.leung at cimr.cam.ac.uk] Sent: 09 December 2005 18:41 To: Gabor Grothendieck Cc: Matthew Dowle; r-devel at r-project.org; Peter Dalgaard Subject: Re: [Rd] [R] data.frame() size Gabor Grothendieck wrote:> There was nothing attached in the copy that came through > to me.I like to see that patch also.> By the way, there was some discussion earlier this year > on a light-weight data.frame class but I don't think anyone ever > posted any code.It may have been me. I am working on a bit-packed data.frame which only uses 2-bits per unit of data, so it is 4 units per RAWSXP. (work in progress, nothing to show). So I am very interested to see the patch. Yes, I took a couple of weeks reading/learning where have all the memory gone in data.frame. The rowname/column names allocation is a bit stupid. Each rowname and each column name is a full R object, so there is a 32(or 28) byte overhead just from managing that, before the STRSXP for the actual string, which is another X bytes. so for an 1 x N data.frame with integers for content, the the content is 4-byte * N, but the rowname/columnname is 32 * N -ish. (a 9x increase). Word is 32-bit on most people's machines, and I am counting the extra one from which you have to keep the address of each SEXPREC somewhere, so it is 7+1 = 8, if I understand it correctly. Here is the relevant comment, quoted verbatum from around line 225 of "src/include/Rinternals.h": /* The generational collector uses a reduced version of SEXPREC as a header in vector nodes. The layout MUST be kept consistent with the SEXPREC definition. The standard SEXPREC takes up 7 words on most hardware; this reduced version should take up only 6 words. In addition to slightly reducing memory use, this can lead to more favorable data alignment on 32-bit architectures like the Intel Pentium III where odd word alignment of doubles is allowed but much less efficient than even word alignment. */ Hin-Tak Leung> On 12/9/05, Matthew Dowle <mdowle at concordiafunds.com> wrote: > >>Hi, >> >>Please see below for post on r-help regarding data.frame() and the >>possibility of dropping rownames, for space and time reasons. I've >>made some changes, attached, and it seems to be working well. I see >>the expected space (90% saved) and time (10 times faster) savings. >>There are no doubt some bugs, and needs more work and testing, but I >>thought I would post first at this stage. >> >>Could some changes along these lines be made to R ? I'm happy to help >>with testing and further work if required. In the meantime I can work >>with overloaded functions which fixes the problems in my case. >> >>Functions effected : >> >> dim.data.frame >> format.data.frame >> print.data.frame >> data.frame >> [.data.frame >> as.matrix.data.frame >> >>Modified source code attached. >> >>Regards, >>Matthew >> >> >>-----Original Message----- >>From: Matthew Dowle >>Sent: 09 December 2005 09:44 >>To: 'Peter Dalgaard' >>Cc: 'r-help at stat.math.ethz.ch' >>Subject: RE: [R] data.frame() size >> >> >> >>That explains it. Thanks. I don't need rownames though, as I'll only >>ever use integer subscripts. Is there anyway to drop them, or even >>better not create them in the first place? The memory saved (90%) by >>not having them and 10 times speed up would be very useful. I think I >>need a data.frame rather than a matrix because I have columns of >>different types in real life. >> >> >>>rownames(d) = NULL >> >>Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" : >> invalid 'dimnames' given for data frame >> >> >>-----Original Message----- >>From: pd at pubhealth.ku.dk [mailto:pd at pubhealth.ku.dk] On Behalf Of >>Peter Dalgaard >>Sent: 08 December 2005 18:57 >>To: Matthew Dowle >>Cc: 'r-help at stat.math.ethz.ch' >>Subject: Re: [R] data.frame() size >> >> >>Matthew Dowle <mdowle at concordiafunds.com> writes: >> >> >>>Hi, >>> >>>In the example below why is d 10 times bigger than m, according to >>>object.size ? It also takes around 10 times as long to create, which >>>fits with object.size() being truthful. gcinfo(TRUE) also indicates >>>a great deal more garbage collector activity caused by data.frame() >>>than matrix(). >>> >>>$ R --vanilla >>>.... >>> >>>>nr = 1000000 >>>>system.time(m<<-matrix(integer(1), nrow=nr, ncol=2)) >>> >>>[1] 0.22 0.01 0.23 0.00 0.00 >>> >>>>system.time(d<<-data.frame(a=integer(nr), b=integer(nr))) >>> >>>[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer >>> >>> >>>>dim(m) >>> >>>[1] 1000000 2 >>> >>>>dim(d) >>> >>>[1] 1000000 2 # same dimensions >>> >>> >>>>storage.mode(m) >>> >>>[1] "integer" >>> >>>>sapply(d, storage.mode) >>> >>> a b >>>"integer" "integer" # same storage.mode >>> >>> >>>>object.size(m)/1024^2 >>> >>>[1] 7.629616 >>> >>>>object.size(d)/1024^2 >>> >>>[1] 76.29482 # but 10 times bigger >>> >>> >>>>sum(sapply(d, object.size))/1024^2 >>> >>>[1] 7.629501 # or is it ? If its not >>>really 10 times bigger, why 10 times longer above ? >> >>Row names!! >> >> >> >>>r <- as.character(1:1e6) >>>object.size(r) >> >>[1] 72000056 >> >>>object.size(r)/1024^2 >> >>[1] 68.6646 >> >>'nuff said? >> >>-- >> O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K >> (*) \(*) -- University of Copenhagen Denmark Ph: (+45)35327918>>~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45)35327907>> >> >> >> >>______________________________________________ >>R-devel at r-project.org mailing list >>https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel