C W
2017-Oct-20 19:51 UTC
[R] What exactly is an dgCMatrix-class. There are so many attributes.
Thank you for your responses. I guess I don't feel alone. I don't find the documentation go into any detail. I also find it surprising that,> object.size(train$data)1730904 bytes> object.size(as.matrix(train$data))6575016 bytes the dgCMatrix actually takes less memory, though it *looks* like the opposite. Cheers! On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net> wrote:> > > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote: > > > > Dear R list, > > > > I came across dgCMatrix. I believe this class is associated with sparse > > matrix. > > Yes. See: > > help('dgCMatrix-class', pack=Matrix) > > If Martin Maechler happens to respond to this you should listen to him > rather than anything I write. Much of what the Matrix package does appears > to be magical to one such as I. > > > > > I see there are 8 attributes to train$data, I am confused why are there > so > > many, some are vectors, what do they do? > > > > Here's the R code: > > > > library(xgboost) > > data(agaricus.train, package='xgboost') > > data(agaricus.test, package='xgboost') > > train <- agaricus.train > > test <- agaricus.test > > attributes(train$data) > > > > I got a bit of an annoying surprise when I did something similar. It > appearred to me that I did not need to load the xgboost library since all > that was being asked was "where is the data" in an object that should be > loaded from that library using the `data` function. The last command asking > for the attributes filled up my console with a 100K length vector (actually > 2 of such vectors). The `str` function returns a more useful result. > > > data(agaricus.train, package='xgboost') > > train <- agaricus.train > > names( attributes(train$data) ) > [1] "i" "p" "Dim" "Dimnames" "x" "factors" > "class" > > str(train$data) > Formal class 'dgCMatrix' [package "Matrix"] with 6 slots > ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... > ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 > ... > ..@ Dim : int [1:2] 6513 126 > ..@ Dimnames:List of 2 > .. ..$ : NULL > .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" > "cap-shape=convex" "cap-shape=flat" ... > ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... > ..@ factors : list() > > > Where is the data, is it in $p, $i, or $x? > > So the "data" (meaning the values of the sparse matrix) are in the @x > leaf. The values all appear to be the number 1. The @i leaf is the sequence > of row locations for the values entries while the @p items are somehow > connected with the columns (I think, since 127 and 126=number of columns > from the @Dim leaf are only off by 1). > > Doing this > colSums(as.matrix(train$data)) > cap-shape=bell cap-shape=conical > 369 3 > cap-shape=convex cap-shape=flat > 2934 2539 > cap-shape=knobbed cap-shape=sunken > 644 24 > cap-surface=fibrous cap-surface=grooves > 1867 4 > cap-surface=scaly cap-surface=smooth > 2607 2035 > cap-color=brown cap-color=buff > 1816 > # now snipping the rest of that output. > > > > Now this makes me think that the @p vector gives you the cumulative sum of > number of items per column: > > > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] ) > [1] TRUE > > > > > Thank you very much! > > > > [[alternative HTML version deleted]] > > Please read the Posting Guide. Your code was not mangled in this instance, > but HTML code often arrives in an unreadable mess. > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posti > ng-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > 'Any technology distinguishable from magic is insufficiently advanced.' > -Gehm's Corollary to Clarke's Third Law > > > > > >[[alternative HTML version deleted]]
C W
2017-Oct-20 20:01 UTC
[R] What exactly is an dgCMatrix-class. There are so many attributes.
Subsetting using [] vs. head(), gives different results. R code:> head(train$data, 5)[1] 0 0 1 0 0> train$data[1:5, 1:5]5 x 5 sparse Matrix of class "dgCMatrix" cap-shape=bell cap-shape=conical cap-shape=convex [1,] . . 1 [2,] . . 1 [3,] 1 . . [4,] . . 1 [5,] . . 1 cap-shape=flat cap-shape=knobbed [1,] . . [2,] . . [3,] . . [4,] . . [5,] . . On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrsg11 at gmail.com> wrote:> Thank you for your responses. > > I guess I don't feel alone. I don't find the documentation go into any > detail. > > I also find it surprising that, > > > object.size(train$data) > 1730904 bytes > > > object.size(as.matrix(train$data)) > 6575016 bytes > > the dgCMatrix actually takes less memory, though it *looks* like the > opposite. > > Cheers! > > On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net> > wrote: > >> >> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote: >> > >> > Dear R list, >> > >> > I came across dgCMatrix. I believe this class is associated with sparse >> > matrix. >> >> Yes. See: >> >> help('dgCMatrix-class', pack=Matrix) >> >> If Martin Maechler happens to respond to this you should listen to him >> rather than anything I write. Much of what the Matrix package does appears >> to be magical to one such as I. >> >> > >> > I see there are 8 attributes to train$data, I am confused why are there >> so >> > many, some are vectors, what do they do? >> > >> > Here's the R code: >> > >> > library(xgboost) >> > data(agaricus.train, package='xgboost') >> > data(agaricus.test, package='xgboost') >> > train <- agaricus.train >> > test <- agaricus.test >> > attributes(train$data) >> > >> >> I got a bit of an annoying surprise when I did something similar. It >> appearred to me that I did not need to load the xgboost library since all >> that was being asked was "where is the data" in an object that should be >> loaded from that library using the `data` function. The last command asking >> for the attributes filled up my console with a 100K length vector (actually >> 2 of such vectors). The `str` function returns a more useful result. >> >> > data(agaricus.train, package='xgboost') >> > train <- agaricus.train >> > names( attributes(train$data) ) >> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >> "class" >> > str(train$data) >> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >> ... >> ..@ Dim : int [1:2] 6513 126 >> ..@ Dimnames:List of 2 >> .. ..$ : NULL >> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >> "cap-shape=convex" "cap-shape=flat" ... >> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >> ..@ factors : list() >> >> > Where is the data, is it in $p, $i, or $x? >> >> So the "data" (meaning the values of the sparse matrix) are in the @x >> leaf. The values all appear to be the number 1. The @i leaf is the sequence >> of row locations for the values entries while the @p items are somehow >> connected with the columns (I think, since 127 and 126=number of columns >> from the @Dim leaf are only off by 1). >> >> Doing this > colSums(as.matrix(train$data)) >> cap-shape=bell cap-shape=conical >> 369 3 >> cap-shape=convex cap-shape=flat >> 2934 2539 >> cap-shape=knobbed cap-shape=sunken >> 644 24 >> cap-surface=fibrous cap-surface=grooves >> 1867 4 >> cap-surface=scaly cap-surface=smooth >> 2607 2035 >> cap-color=brown cap-color=buff >> 1816 >> # now snipping the rest of that output. >> >> >> >> Now this makes me think that the @p vector gives you the cumulative sum >> of number of items per column: >> >> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] ) >> [1] TRUE >> >> > >> > Thank you very much! >> > >> > [[alternative HTML version deleted]] >> >> Please read the Posting Guide. Your code was not mangled in this >> instance, but HTML code often arrives in an unreadable mess. >> >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> David Winsemius >> Alameda, CA, USA >> >> 'Any technology distinguishable from magic is insufficiently advanced.' >> -Gehm's Corollary to Clarke's Third Law >> >> >> >> >> >> >[[alternative HTML version deleted]]
Martin Maechler
2017-Oct-21 14:50 UTC
[R] What exactly is an dgCMatrix-class. There are so many attributes.
>>>>> C W <tmrsg11 at gmail.com> >>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes:> Thank you for your responses. I guess I don't feel > alone. I don't find the documentation go into any detail. > I also find it surprising that, >> object.size(train$data) > 1730904 bytes >> object.size(as.matrix(train$data)) > 6575016 bytes > the dgCMatrix actually takes less memory, though it > *looks* like the opposite. to whom? The whole idea of these sparse matrix classes in the 'Matrix' package (and everywhere else in applied math, CS, ...) is that 1. they need much less memory and 2. matrix arithmetic with them can be much faster because it is based on sophisticated sparse matrix linear algebra, notably the sparse Cholesky decomposition for solve() etc. Of course the efficency only applies if most of the matrix entries _are_ 0. You can measure the "sparsity" or rather the "density", of a matrix by nnzero(A) / length(A) where length(A) == nrow(A) * ncol(A) as for regular matrices (but it does *not* integer overflow) and nnzero(.) is a simple utility from Matrix which -- very efficiently for sparseMatrix objects -- gives the number of nonzero entries of the matrix. All of these classes are formally defined classes and have therefore help pages. Here ?dgCMatrix-class which then points to ?CsparseMatrix-class (and I forget if Rstudio really helps you find these ..; in emacs ESS they are found nicely via the usual key) To get started, you may further look at ?Matrix _and_ ?sparseMatrix (and possibly the Matrix package vignettes --- though they need work -- I'm happy for collaborators there !) Bill Dunlap's comment applies indeed: In principle all these matrices should work like regular numeric matrices, just faster with less memory foot print if they are really sparse (and not just formally of a sparseMatrix class) ((and there are quite a few more niceties in the package)) Martin Maechler (here, maintainer of 'Matrix') > On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net> > wrote: >> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote: >> > >> > Dear R list, >> > >> > I came across dgCMatrix. I believe this class is associated with sparse >> > matrix. >> >> Yes. See: >> >> help('dgCMatrix-class', pack=Matrix) >> >> If Martin Maechler happens to respond to this you should listen to him >> rather than anything I write. Much of what the Matrix package does appears >> to be magical to one such as I. >> >> > >> > I see there are 8 attributes to train$data, I am confused why are there >> so >> > many, some are vectors, what do they do? >> > >> > Here's the R code: >> > >> > library(xgboost) >> > data(agaricus.train, package='xgboost') >> > data(agaricus.test, package='xgboost') >> > train <- agaricus.train >> > test <- agaricus.test >> > attributes(train$data) >> > >> >> I got a bit of an annoying surprise when I did something similar. It >> appearred to me that I did not need to load the xgboost library since all >> that was being asked was "where is the data" in an object that should be >> loaded from that library using the `data` function. The last command asking >> for the attributes filled up my console with a 100K length vector (actually >> 2 of such vectors). The `str` function returns a more useful result. >> >> > data(agaricus.train, package='xgboost') >> > train <- agaricus.train >> > names( attributes(train$data) ) >> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >> "class" >> > str(train$data) >> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >> ... >> ..@ Dim : int [1:2] 6513 126 >> ..@ Dimnames:List of 2 >> .. ..$ : NULL >> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >> "cap-shape=convex" "cap-shape=flat" ... >> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >> ..@ factors : list() >> >> > Where is the data, is it in $p, $i, or $x? >> >> So the "data" (meaning the values of the sparse matrix) are in the @x >> leaf. The values all appear to be the number 1. The @i leaf is the sequence >> of row locations for the values entries while the @p items are somehow >> connected with the columns (I think, since 127 and 126=number of columns >> from the @Dim leaf are only off by 1). You are right David. well, they follow sparse matrix standards which (like C) start counting at 0. >> >> Doing this > colSums(as.matrix(train$data)) The above colSums() again is "very" inefficient: All such R functions have smartly defined Matrix methods that directly work on sparse matrices. Note that as.matrix(M) can "blow up" your R, when the matrix M is really large and sparse such that its dense version does not even fit in your computer's RAM. >> cap-shape=bell cap-shape=conical >> 369 3 >> cap-shape=convex cap-shape=flat >> 2934 2539 >> cap-shape=knobbed cap-shape=sunken >> 644 24 >> cap-surface=fibrous cap-surface=grooves >> 1867 4 >> cap-surface=scaly cap-surface=smooth >> 2607 2035 >> cap-color=brown cap-color=buff >> 1816 >> # now snipping the rest of that output. >> >> >> >> Now this makes me think that the @p vector gives you the cumulative sum of >> number of items per column: >> >> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] ) >> [1] TRUE >> >> > >> > Thank you very much! >> > >> > [[alternative HTML version deleted]] >> >> Please read the Posting Guide. Your code was not mangled in this instance, >> but HTML code often arrives in an unreadable mess. >> >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> David Winsemius >> Alameda, CA, USA >> >> 'Any technology distinguishable from magic is insufficiently advanced.' >> -Gehm's Corollary to Clarke's Third Law >> >> >> >> >> >> > [[alternative HTML version deleted]] > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2017-Oct-21 16:05 UTC
[R] What exactly is an dgCMatrix-class. There are so many attributes.
> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote: > >>>>>> C W <tmrsg11 at gmail.com> >>>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes: > >> Thank you for your responses. I guess I don't feel >> alone. I don't find the documentation go into any detail. > >> I also find it surprising that, > >>> object.size(train$data) >> 1730904 bytes > >>> object.size(as.matrix(train$data)) >> 6575016 bytes > >> the dgCMatrix actually takes less memory, though it >> *looks* like the opposite. > > to whom? > > The whole idea of these sparse matrix classes in the 'Matrix' > package (and everywhere else in applied math, CS, ...) is that > 1. they need much less memory and > 2. matrix arithmetic with them can be much faster because it is based on > sophisticated sparse matrix linear algebra, notably the > sparse Cholesky decomposition for solve() etc. > > Of course the efficency only applies if most of the > matrix entries _are_ 0. > You can measure the "sparsity" or rather the "density", of a > matrix by > > nnzero(A) / length(A) > > where length(A) == nrow(A) * ncol(A) as for regular matrices > (but it does *not* integer overflow) > and nnzero(.) is a simple utility from Matrix > which -- very efficiently for sparseMatrix objects -- gives the > number of nonzero entries of the matrix. > > All of these classes are formally defined classes and have > therefore help pages. Here ?dgCMatrix-class which then points > to ?CsparseMatrix-class (and I forget if Rstudio really helps > you find these ..; in emacs ESS they are found nicely via the usual key) > > To get started, you may further look at ?Matrix _and_ ?sparseMatrix > (and possibly the Matrix package vignettes --- though they need > work -- I'm happy for collaborators there !) > > Bill Dunlap's comment applies indeed: > In principle all these matrices should work like regular numeric > matrices, just faster with less memory foot print if they are > really sparse (and not just formally of a sparseMatrix class) > ((and there are quite a few more niceties in the package)) > > Martin Maechler > (here, maintainer of 'Matrix') > > >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net> >> wrote: > >>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote: >>>> >>>> Dear R list, >>>> >>>> I came across dgCMatrix. I believe this class is associated with sparse >>>> matrix. >>> >>> Yes. See: >>> >>> help('dgCMatrix-class', pack=Matrix) >>> >>> If Martin Maechler happens to respond to this you should listen to him >>> rather than anything I write. Much of what the Matrix package does appears >>> to be magical to one such as I. >>> >>>> >>>> I see there are 8 attributes to train$data, I am confused why are there >>> so >>>> many, some are vectors, what do they do? >>>> >>>> Here's the R code: >>>> >>>> library(xgboost) >>>> data(agaricus.train, package='xgboost') >>>> data(agaricus.test, package='xgboost') >>>> train <- agaricus.train >>>> test <- agaricus.test >>>> attributes(train$data) >>>> >>> >>> I got a bit of an annoying surprise when I did something similar. It >>> appearred to me that I did not need to load the xgboost library since all >>> that was being asked was "where is the data" in an object that should be >>> loaded from that library using the `data` function. The last command asking >>> for the attributes filled up my console with a 100K length vector (actually >>> 2 of such vectors). The `str` function returns a more useful result. >>> >>>> data(agaricus.train, package='xgboost') >>>> train <- agaricus.train >>>> names( attributes(train$data) ) >>> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >>> "class" >>>> str(train$data) >>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >>> ... >>> ..@ Dim : int [1:2] 6513 126 >>> ..@ Dimnames:List of 2 >>> .. ..$ : NULL >>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >>> "cap-shape=convex" "cap-shape=flat" ... >>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >>> ..@ factors : list() >>> >>>> Where is the data, is it in $p, $i, or $x? >>> >>> So the "data" (meaning the values of the sparse matrix) are in the @x >>> leaf. The values all appear to be the number 1. The @i leaf is the sequence >>> of row locations for the values entries while the @p items are somehow >>> connected with the columns (I think, since 127 and 126=number of columns >>> from the @Dim leaf are only off by 1). > > You are right David. > > well, they follow sparse matrix standards which (like C) start > counting at 0. > >>> >>> Doing this > colSums(as.matrix(train$data)) > > The above colSums() again is "very" inefficient: > All such R functions have smartly defined Matrix methods that > directly work on sparse matrices.I did get an error with colSums(train$data)> colSums(train$data)Error in colSums(train$data) : 'x' must be an array of at least two dimensions Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way. library(Matrix) colSums(train$data) # no error> Note that as.matrix(M) can "blow up" your R, when the matrix M > is really large and sparse such that its dense version does not > even fit in your computer's RAM.I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded. I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now, I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that. I sincerely hope my stumbling efforts have not caused any delays. -- David.> >>> cap-shape=bell cap-shape=conical >>> 369 3 >>> cap-shape=convex cap-shape=flat >>> 2934 2539 >>> cap-shape=knobbed cap-shape=sunken >>> 644 24 >>> cap-surface=fibrous cap-surface=grooves >>> 1867 4 >>> cap-surface=scaly cap-surface=smooth >>> 2607 2035 >>> cap-color=brown cap-color=buff >>> 1816 >>> # now snipping the rest of that output. >>> >>> >>> >>> Now this makes me think that the @p vector gives you the cumulative sum of >>> number of items per column: >>> >>>> all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] ) >>> [1] TRUE >>> >>>> >>>> Thank you very much! >>>> >>>> [[alternative HTML version deleted]] >>> >>> Please read the Posting Guide. Your code was not mangled in this instance, >>> but HTML code often arrives in an unreadable mess. >>> >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius >>> Alameda, CA, USA >>> >>> 'Any technology distinguishable from magic is insufficiently advanced.' >>> -Gehm's Corollary to Clarke's Third Law >>> >>> >>> >>> >>> >>> > >> [[alternative HTML version deleted]] > >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
Martin Maechler
2017-Oct-21 16:27 UTC
[R] What exactly is an dgCMatrix-class. There are so many attributes.
>>>>> C W <tmrsg11 at gmail.com> >>>>> on Fri, 20 Oct 2017 16:01:06 -0400 writes:> Subsetting using [] vs. head(), gives different results. > R code: >> head(train$data, 5) > [1] 0 0 1 0 0 The above is surprising ... and points to a bug somewhere. It is different (and correct) after you do require(Matrix) but I think something like that should happen semi-automatically. As I just see, it is even worse if you get the data from xgboost without loading the xgboost package, which you can do (and is also more efficient !): If you start R, and then do data(agaricus.train, package='xgboost') loadedNamespaces() # does not contain "xgboost" nor "Matrix" so, no wonder head(agaricus.train $ data) does not find head()s "Matrix" method [which _is_ exported by Matrix via exportMethods(.)]. But even more curiously, even after I do loadNamespace("Matrix") methods(head) now does show the "Matrix" method, but then head() *still* does not call it. There's a bug somewhere and I suspect it's in R's data() or methods package or ?? rather than in 'Matrix'. But that will be another thread on R-devel or R's bugzilla. Martin >> train$data[1:5, 1:5] > 5 x 5 sparse Matrix of class "dgCMatrix" > cap-shape=bell cap-shape=conical cap-shape=convex > [1,] . . 1 > [2,] . . 1 > [3,] 1 . . > [4,] . . 1 > [5,] . . 1 > cap-shape=flat cap-shape=knobbed > [1,] . . > [2,] . . > [3,] . . > [4,] . . > [5,] . . > On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrsg11 at gmail.com> wrote: >> Thank you for your responses. >> >> I guess I don't feel alone. I don't find the documentation go into any >> detail. >> >> I also find it surprising that, >> >> > object.size(train$data) >> 1730904 bytes >> >> > object.size(as.matrix(train$data)) >> 6575016 bytes >> >> the dgCMatrix actually takes less memory, though it *looks* like the >> opposite. >> >> Cheers! >> >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net> >> wrote: >> >>> >>> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote: >>> > >>> > Dear R list, >>> > >>> > I came across dgCMatrix. I believe this class is associated with sparse >>> > matrix. >>> >>> Yes. See: >>> >>> help('dgCMatrix-class', pack=Matrix) >>> >>> If Martin Maechler happens to respond to this you should listen to him >>> rather than anything I write. Much of what the Matrix package does appears >>> to be magical to one such as I. >>> >>> > >>> > I see there are 8 attributes to train$data, I am confused why are there >>> so >>> > many, some are vectors, what do they do? >>> > >>> > Here's the R code: >>> > >>> > library(xgboost) >>> > data(agaricus.train, package='xgboost') >>> > data(agaricus.test, package='xgboost') >>> > train <- agaricus.train >>> > test <- agaricus.test >>> > attributes(train$data) >>> > >>> >>> I got a bit of an annoying surprise when I did something similar. It >>> appearred to me that I did not need to load the xgboost library since all >>> that was being asked was "where is the data" in an object that should be >>> loaded from that library using the `data` function. The last command asking >>> for the attributes filled up my console with a 100K length vector (actually >>> 2 of such vectors). The `str` function returns a more useful result. >>> >>> > data(agaricus.train, package='xgboost') >>> > train <- agaricus.train >>> > names( attributes(train$data) ) >>> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >>> "class" >>> > str(train$data) >>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >>> ... >>> ..@ Dim : int [1:2] 6513 126 >>> ..@ Dimnames:List of 2 >>> .. ..$ : NULL >>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >>> "cap-shape=convex" "cap-shape=flat" ... >>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >>> ..@ factors : list() >>> >>> > Where is the data, is it in $p, $i, or $x? >>> >>> So the "data" (meaning the values of the sparse matrix) are in the @x >>> leaf. The values all appear to be the number 1. The @i leaf is the sequence >>> of row locations for the values entries while the @p items are somehow >>> connected with the columns (I think, since 127 and 126=number of columns >>> from the @Dim leaf are only off by 1). >>> >>> Doing this > colSums(as.matrix(train$data)) >>> cap-shape=bell cap-shape=conical >>> 369 3 >>> cap-shape=convex cap-shape=flat >>> 2934 2539 >>> cap-shape=knobbed cap-shape=sunken >>> 644 24 >>> cap-surface=fibrous cap-surface=grooves >>> 1867 4 >>> cap-surface=scaly cap-surface=smooth >>> 2607 2035 >>> cap-color=brown cap-color=buff >>> 1816 >>> # now snipping the rest of that output. >>> >>> >>> >>> Now this makes me think that the @p vector gives you the cumulative sum >>> of number of items per column: >>> >>> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] ) >>> [1] TRUE >>> >>> > >>> > Thank you very much! >>> > >>> > [[alternative HTML version deleted]] >>> >>> Please read the Posting Guide. Your code was not mangled in this >>> instance, but HTML code often arrives in an unreadable mess. >>> >>> > >>> > ______________________________________________ >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius >>> Alameda, CA, USA >>> >>> 'Any technology distinguishable from magic is insufficiently advanced.' >>> -Gehm's Corollary to Clarke's Third Law >>> >>> >>> >>> >>> >>> >> > [[alternative HTML version deleted]] > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- What exactly is an dgCMatrix-class. There are so many attributes.
- What exactly is an dgCMatrix-class. There are so many attributes.
- What exactly is an dgCMatrix-class. There are so many attributes.
- What exactly is an dgCMatrix-class. There are so many attributes.
- What exactly is an dgCMatrix-class. There are so many attributes.