thr3ads.net - R help - [R] What exactly is an dgCMatrix-class. There are so many attributes. [Oct 2017]

If this information is useful, please help other people find it:
Share via:

C W

2017-Oct-20 19:51 UTC

[R] What exactly is an dgCMatrix-class. There are so many attributes.

Thank you for your responses.

I guess I don't feel alone. I don't find the documentation go into any
detail.

I also find it surprising that,
> object.size(train$data)1730904 bytes
> object.size(as.matrix(train$data))6575016 bytes

the dgCMatrix actually takes less memory, though it *looks* like the
opposite.

Cheers!

On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at
comcast.net>
wrote:
>
> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
> >
> > Dear R list,
> >
> > I came across dgCMatrix. I believe this class is associated with
sparse
> > matrix.
>
> Yes. See:
>
>  help('dgCMatrix-class', pack=Matrix)
>
> If Martin Maechler happens to respond to this you should listen to him
> rather than anything I write. Much of what the Matrix package does appears
> to be magical to one such as I.
>
> >
> > I see there are 8 attributes to train$data, I am confused why are
there
> so
> > many, some are vectors, what do they do?
> >
> > Here's the R code:
> >
> > library(xgboost)
> > data(agaricus.train, package='xgboost')
> > data(agaricus.test, package='xgboost')
> > train <- agaricus.train
> > test <- agaricus.test
> > attributes(train$data)
> >
>
> I got a bit of an annoying surprise when I did something similar. It
> appearred to me that I did not need to load the xgboost library since all
> that was being asked was "where is the data" in an object that
should be
> loaded from that library using the `data` function. The last command asking
> for the attributes filled up my console with a 100K length vector (actually
> 2 of such vectors). The `str` function returns a more useful result.
>
> > data(agaricus.train, package='xgboost')
> > train <- agaricus.train
> > names( attributes(train$data) )
> [1] "i"        "p"        "Dim"     
"Dimnames" "x"        "factors"
> "class"
> > str(train$data)
> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>   ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>   ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
> ...
>   ..@ Dim     : int [1:2] 6513 126
>   ..@ Dimnames:List of 2
>   .. ..$ : NULL
>   .. ..$ : chr [1:126] "cap-shape=bell"
"cap-shape=conical"
> "cap-shape=convex" "cap-shape=flat" ...
>   ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>   ..@ factors : list()
>
> > Where is the data, is it in $p, $i, or $x?
>
> So the "data" (meaning the values of the sparse matrix) are in
the @x
> leaf. The values all appear to be the number 1. The @i leaf is the sequence
> of row locations for the values entries while the @p items are somehow
> connected with the columns (I think, since 127 and 126=number of columns
> from the @Dim leaf are only off by 1).
>
> Doing this > colSums(as.matrix(train$data))
>                   cap-shape=bell                cap-shape=conical
>                              369                                3
>                 cap-shape=convex                   cap-shape=flat
>                             2934                             2539
>                cap-shape=knobbed                 cap-shape=sunken
>                              644                               24
>              cap-surface=fibrous              cap-surface=grooves
>                             1867                                4
>                cap-surface=scaly               cap-surface=smooth
>                             2607                             2035
>                  cap-color=brown                   cap-color=buff
>                             1816
> # now snipping the rest of that output.
>
>
>
> Now this makes me think that the @p vector gives you the cumulative sum of
> number of items per column:
>
> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
> [1] TRUE
>
> >
> > Thank you very much!
> >
> >       [[alternative HTML version deleted]]
>
> Please read the Posting Guide. Your code was not mangled in this instance,
> but HTML code often arrives in an unreadable mess.
>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> 'Any technology distinguishable from magic is insufficiently
advanced.'
>  -Gehm's Corollary to Clarke's Third Law
>
>
>
>
>
>
	[[alternative HTML version deleted]]

C W

2017-Oct-20 20:01 UTC

head link

[R] What exactly is an dgCMatrix-class. There are so many attributes.

Subsetting using [] vs. head(), gives different results.

R code:
> head(train$data, 5)[1] 0 0 1 0 0
> train$data[1:5, 1:5]5 x 5 sparse Matrix of class "dgCMatrix"
     cap-shape=bell cap-shape=conical cap-shape=convex
[1,]              .                 .                1
[2,]              .                 .                1
[3,]              1                 .                .
[4,]              .                 .                1
[5,]              .                 .                1
     cap-shape=flat cap-shape=knobbed
[1,]              .                 .
[2,]              .                 .
[3,]              .                 .
[4,]              .                 .
[5,]              .                 .

On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrsg11 at gmail.com> wrote:
> Thank you for your responses.
>
> I guess I don't feel alone. I don't find the documentation go into
any
> detail.
>
> I also find it surprising that,
>
> > object.size(train$data)
> 1730904 bytes
>
> > object.size(as.matrix(train$data))
> 6575016 bytes
>
> the dgCMatrix actually takes less memory, though it *looks* like the
> opposite.
>
> Cheers!
>
> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at
comcast.net>
> wrote:
>
>>
>> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com>
wrote:
>> >
>> > Dear R list,
>> >
>> > I came across dgCMatrix. I believe this class is associated with
sparse
>> > matrix.
>>
>> Yes. See:
>>
>>  help('dgCMatrix-class', pack=Matrix)
>>
>> If Martin Maechler happens to respond to this you should listen to him
>> rather than anything I write. Much of what the Matrix package does
appears
>> to be magical to one such as I.
>>
>> >
>> > I see there are 8 attributes to train$data, I am confused why are
there
>> so
>> > many, some are vectors, what do they do?
>> >
>> > Here's the R code:
>> >
>> > library(xgboost)
>> > data(agaricus.train, package='xgboost')
>> > data(agaricus.test, package='xgboost')
>> > train <- agaricus.train
>> > test <- agaricus.test
>> > attributes(train$data)
>> >
>>
>> I got a bit of an annoying surprise when I did something similar. It
>> appearred to me that I did not need to load the xgboost library since
all
>> that was being asked was "where is the data" in an object
that should be
>> loaded from that library using the `data` function. The last command
asking
>> for the attributes filled up my console with a 100K length vector
(actually
>> 2 of such vectors). The `str` function returns a more useful result.
>>
>> > data(agaricus.train, package='xgboost')
>> > train <- agaricus.train
>> > names( attributes(train$data) )
>> [1] "i"        "p"        "Dim"     
"Dimnames" "x"        "factors"
>> "class"
>> > str(train$data)
>> Formal class 'dgCMatrix' [package "Matrix"] with 6
slots
>>   ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>   ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384
10991
>> ...
>>   ..@ Dim     : int [1:2] 6513 126
>>   ..@ Dimnames:List of 2
>>   .. ..$ : NULL
>>   .. ..$ : chr [1:126] "cap-shape=bell"
"cap-shape=conical"
>> "cap-shape=convex" "cap-shape=flat" ...
>>   ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>   ..@ factors : list()
>>
>> > Where is the data, is it in $p, $i, or $x?
>>
>> So the "data" (meaning the values of the sparse matrix) are
in the @x
>> leaf. The values all appear to be the number 1. The @i leaf is the
sequence
>> of row locations for the values entries while the @p items are somehow
>> connected with the columns (I think, since 127 and 126=number of
columns
>> from the @Dim leaf are only off by 1).
>>
>> Doing this > colSums(as.matrix(train$data))
>>                   cap-shape=bell                cap-shape=conical
>>                              369                                3
>>                 cap-shape=convex                   cap-shape=flat
>>                             2934                             2539
>>                cap-shape=knobbed                 cap-shape=sunken
>>                              644                               24
>>              cap-surface=fibrous              cap-surface=grooves
>>                             1867                                4
>>                cap-surface=scaly               cap-surface=smooth
>>                             2607                             2035
>>                  cap-color=brown                   cap-color=buff
>>                             1816
>> # now snipping the rest of that output.
>>
>>
>>
>> Now this makes me think that the @p vector gives you the cumulative sum
>> of number of items per column:
>>
>> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at
p[-1] )
>> [1] TRUE
>>
>> >
>> > Thank you very much!
>> >
>> >       [[alternative HTML version deleted]]
>>
>> Please read the Posting Guide. Your code was not mangled in this
>> instance, but HTML code often arrives in an unreadable mess.
>>
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> 'Any technology distinguishable from magic is insufficiently
advanced.'
>>  -Gehm's Corollary to Clarke's Third Law
>>
>>
>>
>>
>>
>>
>
	[[alternative HTML version deleted]]

Martin Maechler

2017-Oct-21 14:50 UTC

head link

[R] What exactly is an dgCMatrix-class. There are so many attributes.

>>>>> C W <tmrsg11 at gmail.com>
>>>>>     on Fri, 20 Oct 2017 15:51:16 -0400 writes:
    > Thank you for your responses.  I guess I don't feel
    > alone. I don't find the documentation go into any detail.

    > I also find it surprising that,

    >> object.size(train$data)
    > 1730904 bytes

    >> object.size(as.matrix(train$data))
    > 6575016 bytes

    > the dgCMatrix actually takes less memory, though it
    > *looks* like the opposite.

to whom?

The whole idea of these sparse matrix classes in the 'Matrix'
package (and everywhere else in applied math, CS, ...) is that
1. they need  much less memory   and
2. matrix arithmetic with them can be much faster because it is based on
   sophisticated sparse matrix linear algebra, notably the
   sparse Cholesky decomposition for solve() etc.

Of course the efficency only applies if most of the
matrix entries _are_ 0.
You can measure the  "sparsity" or rather the  "density", of
a
matrix by

  nnzero(A) / length(A)

where length(A) == nrow(A) * ncol(A)  as for regular matrices
(but it does *not* integer overflow)
and nnzero(.) is a simple utility from Matrix
which -- very efficiently for sparseMatrix objects -- gives the
number of nonzero entries of the matrix.
   
All of these classes are formally defined classes and have
therefore help pages.  Here  ?dgCMatrix-class  which then points
to  ?CsparseMatrix-class  (and I forget if Rstudio really helps
you find these ..; in emacs ESS they are found nicely via the usual key)

To get started, you may further look at  ?Matrix _and_  ?sparseMatrix
(and possibly the Matrix package vignettes --- though they need
 work -- I'm happy for collaborators there !)

Bill Dunlap's comment applies indeed:
In principle all these matrices should work like regular numeric
matrices, just faster with less memory foot print if they are
really sparse (and not just formally of a sparseMatrix class)
  ((and there are quite a few more niceties in the package))

Martin Maechler
(here, maintainer of 'Matrix')


    > On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at
comcast.net>
    > wrote:

    >> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com>
wrote:
    >> >
    >> > Dear R list,
    >> >
    >> > I came across dgCMatrix. I believe this class is associated
with sparse
    >> > matrix.
    >> 
    >> Yes. See:
    >> 
    >> help('dgCMatrix-class', pack=Matrix)
    >> 
    >> If Martin Maechler happens to respond to this you should listen to
him
    >> rather than anything I write. Much of what the Matrix package does
appears
    >> to be magical to one such as I.
    >> 
    >> >
    >> > I see there are 8 attributes to train$data, I am confused why
are there
    >> so
    >> > many, some are vectors, what do they do?
    >> >
    >> > Here's the R code:
    >> >
    >> > library(xgboost)
    >> > data(agaricus.train, package='xgboost')
    >> > data(agaricus.test, package='xgboost')
    >> > train <- agaricus.train
    >> > test <- agaricus.test
    >> > attributes(train$data)
    >> >
    >> 
    >> I got a bit of an annoying surprise when I did something similar.
It
    >> appearred to me that I did not need to load the xgboost library
since all
    >> that was being asked was "where is the data" in an object
that should be
    >> loaded from that library using the `data` function. The last
command asking
    >> for the attributes filled up my console with a 100K length vector
(actually
    >> 2 of such vectors). The `str` function returns a more useful
result.
    >> 
    >> > data(agaricus.train, package='xgboost')
    >> > train <- agaricus.train
    >> > names( attributes(train$data) )
    >> [1] "i"        "p"        "Dim"     
"Dimnames" "x"        "factors"
    >> "class"
    >> > str(train$data)
    >> Formal class 'dgCMatrix' [package "Matrix"] with
6 slots
    >> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
    >> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384
10991
    >> ...
    >> ..@ Dim     : int [1:2] 6513 126
    >> ..@ Dimnames:List of 2
    >> .. ..$ : NULL
    >> .. ..$ : chr [1:126] "cap-shape=bell"
"cap-shape=conical"
    >> "cap-shape=convex" "cap-shape=flat" ...
    >> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
    >> ..@ factors : list()
    >> 
    >> > Where is the data, is it in $p, $i, or $x?
    >> 
    >> So the "data" (meaning the values of the sparse matrix)
are in the @x
    >> leaf. The values all appear to be the number 1. The @i leaf is the
sequence
    >> of row locations for the values entries while the @p items are
somehow
    >> connected with the columns (I think, since 127 and 126=number of
columns
    >> from the @Dim leaf are only off by 1).

You are right David.

well, they follow sparse matrix standards which (like C) start
counting at 0.

    >> 
    >> Doing this > colSums(as.matrix(train$data))

The above colSums() again is "very" inefficient:
All such R functions  have smartly defined  Matrix methods that
directly work on sparse matrices.

Note that  as.matrix(M)  can "blow up" your R, when the matrix M
is really large and sparse such that its dense version does not
even fit in your computer's RAM.

    >> cap-shape=bell                cap-shape=conical
    >> 369                                3
    >> cap-shape=convex                   cap-shape=flat
    >> 2934                             2539
    >> cap-shape=knobbed                 cap-shape=sunken
    >> 644                               24
    >> cap-surface=fibrous              cap-surface=grooves
    >> 1867                                4
    >> cap-surface=scaly               cap-surface=smooth
    >> 2607                             2035
    >> cap-color=brown                   cap-color=buff
    >> 1816
    >> # now snipping the rest of that output.
    >> 
    >> 
    >> 
    >> Now this makes me think that the @p vector gives you the cumulative
sum of
    >> number of items per column:
    >> 
    >> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at
p[-1] )
    >> [1] TRUE
    >> 
    >> >
    >> > Thank you very much!
    >> >
    >> >       [[alternative HTML version deleted]]
    >> 
    >> Please read the Posting Guide. Your code was not mangled in this
instance,
    >> but HTML code often arrives in an unreadable mess.
    >> 
    >> >
    >> > ______________________________________________
    >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
    >> > https://stat.ethz.ch/mailman/listinfo/r-help
    >> > PLEASE do read the posting guide
http://www.R-project.org/posti
    >> ng-guide.html
    >> > and provide commented, minimal, self-contained, reproducible
code.
    >> 
    >> David Winsemius
    >> Alameda, CA, USA
    >> 
    >> 'Any technology distinguishable from magic is insufficiently
advanced.'
    >> -Gehm's Corollary to Clarke's Third Law
    >> 
    >> 
    >> 
    >> 
    >> 
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.

David Winsemius

2017-Oct-21 16:05 UTC

head link

[R] What exactly is an dgCMatrix-class. There are so many attributes.

> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:
> 
>>>>>> C W <tmrsg11 at gmail.com>
>>>>>>    on Fri, 20 Oct 2017 15:51:16 -0400 writes:
> 
>> Thank you for your responses.  I guess I don't feel
>> alone. I don't find the documentation go into any detail.
> 
>> I also find it surprising that,
> 
>>> object.size(train$data)
>> 1730904 bytes
> 
>>> object.size(as.matrix(train$data))
>> 6575016 bytes
> 
>> the dgCMatrix actually takes less memory, though it
>> *looks* like the opposite.
> 
> to whom?
> 
> The whole idea of these sparse matrix classes in the 'Matrix'
> package (and everywhere else in applied math, CS, ...) is that
> 1. they need  much less memory   and
> 2. matrix arithmetic with them can be much faster because it is based on
>   sophisticated sparse matrix linear algebra, notably the
>   sparse Cholesky decomposition for solve() etc.
> 
> Of course the efficency only applies if most of the
> matrix entries _are_ 0.
> You can measure the  "sparsity" or rather the 
"density", of a
> matrix by
> 
>  nnzero(A) / length(A)
> 
> where length(A) == nrow(A) * ncol(A)  as for regular matrices
> (but it does *not* integer overflow)
> and nnzero(.) is a simple utility from Matrix
> which -- very efficiently for sparseMatrix objects -- gives the
> number of nonzero entries of the matrix.
> 
> All of these classes are formally defined classes and have
> therefore help pages.  Here  ?dgCMatrix-class  which then points
> to  ?CsparseMatrix-class  (and I forget if Rstudio really helps
> you find these ..; in emacs ESS they are found nicely via the usual key)
> 
> To get started, you may further look at  ?Matrix _and_  ?sparseMatrix
> (and possibly the Matrix package vignettes --- though they need
> work -- I'm happy for collaborators there !)
> 
> Bill Dunlap's comment applies indeed:
> In principle all these matrices should work like regular numeric
> matrices, just faster with less memory foot print if they are
> really sparse (and not just formally of a sparseMatrix class)
>  ((and there are quite a few more niceties in the package))
> 
> Martin Maechler
> (here, maintainer of 'Matrix')
> 
> 
>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at
comcast.net>
>> wrote:
> 
>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com>
wrote:
>>>> 
>>>> Dear R list,
>>>> 
>>>> I came across dgCMatrix. I believe this class is associated
with sparse
>>>> matrix.
>>> 
>>> Yes. See:
>>> 
>>> help('dgCMatrix-class', pack=Matrix)
>>> 
>>> If Martin Maechler happens to respond to this you should listen to
him
>>> rather than anything I write. Much of what the Matrix package does
appears
>>> to be magical to one such as I.
>>> 
>>>> 
>>>> I see there are 8 attributes to train$data, I am confused why
are there
>>> so
>>>> many, some are vectors, what do they do?
>>>> 
>>>> Here's the R code:
>>>> 
>>>> library(xgboost)
>>>> data(agaricus.train, package='xgboost')
>>>> data(agaricus.test, package='xgboost')
>>>> train <- agaricus.train
>>>> test <- agaricus.test
>>>> attributes(train$data)
>>>> 
>>> 
>>> I got a bit of an annoying surprise when I did something similar.
It
>>> appearred to me that I did not need to load the xgboost library
since all
>>> that was being asked was "where is the data" in an object
that should be
>>> loaded from that library using the `data` function. The last
command asking
>>> for the attributes filled up my console with a 100K length vector
(actually
>>> 2 of such vectors). The `str` function returns a more useful
result.
>>> 
>>>> data(agaricus.train, package='xgboost')
>>>> train <- agaricus.train
>>>> names( attributes(train$data) )
>>> [1] "i"        "p"        "Dim"     
"Dimnames" "x"        "factors"
>>> "class"
>>>> str(train$data)
>>> Formal class 'dgCMatrix' [package "Matrix"] with
6 slots
>>> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384
10991
>>> ...
>>> ..@ Dim     : int [1:2] 6513 126
>>> ..@ Dimnames:List of 2
>>> .. ..$ : NULL
>>> .. ..$ : chr [1:126] "cap-shape=bell"
"cap-shape=conical"
>>> "cap-shape=convex" "cap-shape=flat" ...
>>> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>> ..@ factors : list()
>>> 
>>>> Where is the data, is it in $p, $i, or $x?
>>> 
>>> So the "data" (meaning the values of the sparse matrix)
are in the @x
>>> leaf. The values all appear to be the number 1. The @i leaf is the
sequence
>>> of row locations for the values entries while the @p items are
somehow
>>> connected with the columns (I think, since 127 and 126=number of
columns
>>> from the @Dim leaf are only off by 1).
> 
> You are right David.
> 
> well, they follow sparse matrix standards which (like C) start
> counting at 0.
> 
>>> 
>>> Doing this > colSums(as.matrix(train$data))
> 
> The above colSums() again is "very" inefficient:
> All such R functions  have smartly defined  Matrix methods that
> directly work on sparse matrices.
I did get an error with colSums(train$data)
> colSums(train$data)Error in colSums(train$data) : 
  'x' must be an array of at least two dimensions

Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps
the xgboost package only imports certain functions from pkg:Matrix and that
colSums is not one of them. This resembles the errors I get when I try to use
grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid
I always am surprised when this happens and after a headslap and explicitly
loading pfk:grid I continue on my stumbling way.


library(Matrix)
colSums(train$data)   # no error

> Note that  as.matrix(M)  can "blow up" your R, when the matrix M
> is really large and sparse such that its dense version does not
> even fit in your computer's RAM.
I did know that, so I first calculated whether the dense matrix version of that
object would fit in my RAM space and it fit easily so I proceeded.

I find the TsparseMatrix indexing easier for my more naive notion of sparsity,
although thinking about it now,  I think I can see that the CsparseMatrix more
closely resembles the "folded vector" design of dense R matrices. I
will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the
"inner" indices. I should probably stop doing that.

I sincerely hope my stumbling efforts have not caused any delays.

-- 
David.

> 
>>> cap-shape=bell                cap-shape=conical
>>> 369                                3
>>> cap-shape=convex                   cap-shape=flat
>>> 2934                             2539
>>> cap-shape=knobbed                 cap-shape=sunken
>>> 644                               24
>>> cap-surface=fibrous              cap-surface=grooves
>>> 1867                                4
>>> cap-surface=scaly               cap-surface=smooth
>>> 2607                             2035
>>> cap-color=brown                   cap-color=buff
>>> 1816
>>> # now snipping the rest of that output.
>>> 
>>> 
>>> 
>>> Now this makes me think that the @p vector gives you the cumulative
sum of
>>> number of items per column:
>>> 
>>>> all( cumsum( colSums(as.matrix(train$data)) ) == train$data at
p[-1] )
>>> [1] TRUE
>>> 
>>>> 
>>>> Thank you very much!
>>>> 
>>>>      [[alternative HTML version deleted]]
>>> 
>>> Please read the Posting Guide. Your code was not mangled in this
instance,
>>> but HTML code often arrives in an unreadable mess.
>>> 
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>> 
>>> David Winsemius
>>> Alameda, CA, USA
>>> 
>>> 'Any technology distinguishable from magic is insufficiently
advanced.'
>>> -Gehm's Corollary to Clarke's Third Law
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
>> [[alternative HTML version deleted]]
> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' 
-Gehm's Corollary to Clarke's Third Law

Martin Maechler

2017-Oct-21 16:27 UTC

head link

[R] What exactly is an dgCMatrix-class. There are so many attributes.

>>>>> C W <tmrsg11 at gmail.com>
>>>>>     on Fri, 20 Oct 2017 16:01:06 -0400 writes:
    > Subsetting using [] vs. head(), gives different results.
    > R code:

    >> head(train$data, 5)
    > [1] 0 0 1 0 0

The above is surprising ... and points to a bug somewhere.
It is different (and correct) after you do

   require(Matrix)

but I think something like that should happen
semi-automatically.

As I just see, it is even worse if you get the data from xgboost
without loading the xgboost package, which you can do (and is
also more efficient !):

If you start R, and then do

   data(agaricus.train, package='xgboost')

   loadedNamespaces() # does not contain "xgboost" nor
"Matrix"

so, no wonder

   head(agaricus.train $ data)

does not find head()s "Matrix" method [which _is_ exported by Matrix
via  exportMethods(.)].
But even more curiously, even after I do

    loadNamespace("Matrix")
    methods(head)

now does show the "Matrix" method,
but then head() *still* does not call it.  There's a bug
somewhere and I suspect it's in R's data() or methods package or
?? rather than in 'Matrix'.
But that will be another thread on R-devel or R's bugzilla.

Martin


    >> train$data[1:5, 1:5]
    > 5 x 5 sparse Matrix of class "dgCMatrix"
    >      cap-shape=bell cap-shape=conical cap-shape=convex
    > [1,]              .                 .                1
    > [2,]              .                 .                1
    > [3,]              1                 .                .
    > [4,]              .                 .                1
    > [5,]              .                 .                1
    >      cap-shape=flat cap-shape=knobbed
    > [1,]              .                 .
    > [2,]              .                 .
    > [3,]              .                 .
    > [4,]              .                 .
    > [5,]              .                 .

    > On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrsg11 at gmail.com>
wrote:

    >> Thank you for your responses.
    >> 
    >> I guess I don't feel alone. I don't find the documentation
go into any
    >> detail.
    >> 
    >> I also find it surprising that,
    >> 
    >> > object.size(train$data)
    >> 1730904 bytes
    >> 
    >> > object.size(as.matrix(train$data))
    >> 6575016 bytes
    >> 
    >> the dgCMatrix actually takes less memory, though it *looks* like
the
    >> opposite.
    >> 
    >> Cheers!
    >> 
    >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at
comcast.net>
    >> wrote:
    >> 
    >>> 
    >>> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at
gmail.com> wrote:
    >>> >
    >>> > Dear R list,
    >>> >
    >>> > I came across dgCMatrix. I believe this class is
associated with sparse
    >>> > matrix.
    >>> 
    >>> Yes. See:
    >>> 
    >>> help('dgCMatrix-class', pack=Matrix)
    >>> 
    >>> If Martin Maechler happens to respond to this you should listen
to him
    >>> rather than anything I write. Much of what the Matrix package
does appears
    >>> to be magical to one such as I.
    >>> 
    >>> >
    >>> > I see there are 8 attributes to train$data, I am confused
why are there
    >>> so
    >>> > many, some are vectors, what do they do?
    >>> >
    >>> > Here's the R code:
    >>> >
    >>> > library(xgboost)
    >>> > data(agaricus.train, package='xgboost')
    >>> > data(agaricus.test, package='xgboost')
    >>> > train <- agaricus.train
    >>> > test <- agaricus.test
    >>> > attributes(train$data)
    >>> >
    >>> 
    >>> I got a bit of an annoying surprise when I did something
similar. It
    >>> appearred to me that I did not need to load the xgboost library
since all
    >>> that was being asked was "where is the data" in an
object that should be
    >>> loaded from that library using the `data` function. The last
command asking
    >>> for the attributes filled up my console with a 100K length
vector (actually
    >>> 2 of such vectors). The `str` function returns a more useful
result.
    >>> 
    >>> > data(agaricus.train, package='xgboost')
    >>> > train <- agaricus.train
    >>> > names( attributes(train$data) )
    >>> [1] "i"        "p"        "Dim"  
"Dimnames" "x"        "factors"
    >>> "class"
    >>> > str(train$data)
    >>> Formal class 'dgCMatrix' [package "Matrix"]
with 6 slots
    >>> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
    >>> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380
8384 10991
    >>> ...
    >>> ..@ Dim     : int [1:2] 6513 126
    >>> ..@ Dimnames:List of 2
    >>> .. ..$ : NULL
    >>> .. ..$ : chr [1:126] "cap-shape=bell"
"cap-shape=conical"
    >>> "cap-shape=convex" "cap-shape=flat" ...
    >>> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
    >>> ..@ factors : list()
    >>> 
    >>> > Where is the data, is it in $p, $i, or $x?
    >>> 
    >>> So the "data" (meaning the values of the sparse
matrix) are in the @x
    >>> leaf. The values all appear to be the number 1. The @i leaf is
the sequence
    >>> of row locations for the values entries while the @p items are
somehow
    >>> connected with the columns (I think, since 127 and 126=number
of columns
    >>> from the @Dim leaf are only off by 1).
    >>> 
    >>> Doing this > colSums(as.matrix(train$data))
    >>> cap-shape=bell                cap-shape=conical
    >>> 369                                3
    >>> cap-shape=convex                   cap-shape=flat
    >>> 2934                             2539
    >>> cap-shape=knobbed                 cap-shape=sunken
    >>> 644                               24
    >>> cap-surface=fibrous              cap-surface=grooves
    >>> 1867                                4
    >>> cap-surface=scaly               cap-surface=smooth
    >>> 2607                             2035
    >>> cap-color=brown                   cap-color=buff
    >>> 1816
    >>> # now snipping the rest of that output.
    >>> 
    >>> 
    >>> 
    >>> Now this makes me think that the @p vector gives you the
cumulative sum
    >>> of number of items per column:
    >>> 
    >>> > all( cumsum( colSums(as.matrix(train$data)) ) ==
train$data at p[-1] )
    >>> [1] TRUE
    >>> 
    >>> >
    >>> > Thank you very much!
    >>> >
    >>> >       [[alternative HTML version deleted]]
    >>> 
    >>> Please read the Posting Guide. Your code was not mangled in
this
    >>> instance, but HTML code often arrives in an unreadable mess.
    >>> 
    >>> >
    >>> > ______________________________________________
    >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
    >>> > https://stat.ethz.ch/mailman/listinfo/r-help
    >>> > PLEASE do read the posting guide
http://www.R-project.org/posti
    >>> ng-guide.html
    >>> > and provide commented, minimal, self-contained,
reproducible code.
    >>> 
    >>> David Winsemius
    >>> Alameda, CA, USA
    >>> 
    >>> 'Any technology distinguishable from magic is
insufficiently advanced.'
    >>> -Gehm's Corollary to Clarke's Third Law
    >>> 
    >>> 
    >>> 
    >>> 
    >>> 
    >>> 
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more maybe matching threads

R help - Oct 2017 - What exactly is an dgCMatrix-class. There are so many attributes.

[R] What exactly is an dgCMatrix-class. There are so many attributes.

[R] What exactly is an dgCMatrix-class. There are so many attributes.

[R] What exactly is an dgCMatrix-class. There are so many attributes.

[R] What exactly is an dgCMatrix-class. There are so many attributes.

[R] What exactly is an dgCMatrix-class. There are so many attributes.

Reasonably Related Threads