thr3ads.net - R devel - [Rd] head.matrix can return 1000s of columns -- limit to n or add new argument? [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Gabriel Becker

2019-Oct-29 19:43 UTC

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

Hi all,

So I've started working on this and I ran into something that I didn't
know, namely that for x a multi-dimensional (2+) array, head(x) and tail(x)
ignore dimension completely, treat x as an atomic vector, and return an
(unclassed) atomic vector:
> x = array(100, c(4, 5, 5))
> dim(x)
[1] 4 5 5
> head(x, 1)
[1] 100
> class(head(x))
[1] "numeric"


(For a 1d array, it does return another 1d array).

When extending head/tail to understand multiple dimensions as discussed in
this thread, then, should the behavior for 2+d arrays be explicitly
retained, or should head and tail do the analogous thing (with a head(<2d
array>) behaving the same as head(<matrix>), which honestly is what I
expected to already be happening)?

Are people using/relying on this behavior in their code, and if so, why/for
what?

Even more generally, one way forward is to have the default methods check
for dimensions, and use length if it is null:

tail.default <- tail.data.frame <- function(x, n = 6L, ...)
{
    if(any(n == 0))
        stop("n must be non-zero or unspecified for all dimensions")
    if(!is.null(dim(x)))
        dimsx <- dim(x)
    else
        dimsx <- length(x)

    ## this returns a list of vectors of indices in each
    ## dimension, regardless of length of the the n
    ## argument
    sel <- lapply(seq_along(dimsx), function(i) {
        dxi <- dimsx[i]
        ## select all indices (full dim) if not specified
        ni <- if(length(n) >= i) n[i] else dxi
        ## handle negative ns
        ni <- if (ni < 0L) max(dxi + ni, 0L) else min(ni, dxi)
        seq.int(to = dxi, length.out = ni)
    })
    args <- c(list(x), sel, drop = FALSE)
    do.call("[", args)
}


I think this precludes the need for a separate data.frame method at all,
actually, though (I would think) tail.data.frame would still be defined and
exported for backwards compatibility. (the matrix method has some extra
bits so my current conception of it is still separate, though it might not
NEED to be).

The question then becomes, should head/tail always return something with
the same dimensionally (number of dims) it got, or should data.frame and
matrix be special cased in this regard, as they are now?

What are people's thoughts?
~G

	[[alternative HTML version deleted]]

Jan Gorecki

2019-Oct-30 05:31 UTC

head link

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

Gabriel,
My view is rather radical.

- head/tail should return object having same number of dimensions
- data.frame should be a special case
- matrix should be handled as 2D array

P.S. idea of accepting `n` argument as a vector of corresponding
dimensions is a brilliant one

On Wed, Oct 30, 2019 at 1:13 AM Gabriel Becker <gabembecker at gmail.com>
wrote:>
> Hi all,
>
> So I've started working on this and I ran into something that I
didn't
> know, namely that for x a multi-dimensional (2+) array, head(x) and tail(x)
> ignore dimension completely, treat x as an atomic vector, and return an
> (unclassed) atomic vector:
>
> > x = array(100, c(4, 5, 5))
>
> > dim(x)
>
> [1] 4 5 5
>
> > head(x, 1)
>
> [1] 100
>
> > class(head(x))
>
> [1] "numeric"
>
>
> (For a 1d array, it does return another 1d array).
>
> When extending head/tail to understand multiple dimensions as discussed in
> this thread, then, should the behavior for 2+d arrays be explicitly
> retained, or should head and tail do the analogous thing (with a
head(<2d
> array>) behaving the same as head(<matrix>), which honestly is
what I
> expected to already be happening)?
>
> Are people using/relying on this behavior in their code, and if so, why/for
> what?
>
> Even more generally, one way forward is to have the default methods check
> for dimensions, and use length if it is null:
>
> tail.default <- tail.data.frame <- function(x, n = 6L, ...)
> {
>     if(any(n == 0))
>         stop("n must be non-zero or unspecified for all
dimensions")
>     if(!is.null(dim(x)))
>         dimsx <- dim(x)
>     else
>         dimsx <- length(x)
>
>     ## this returns a list of vectors of indices in each
>     ## dimension, regardless of length of the the n
>     ## argument
>     sel <- lapply(seq_along(dimsx), function(i) {
>         dxi <- dimsx[i]
>         ## select all indices (full dim) if not specified
>         ni <- if(length(n) >= i) n[i] else dxi
>         ## handle negative ns
>         ni <- if (ni < 0L) max(dxi + ni, 0L) else min(ni, dxi)
>         seq.int(to = dxi, length.out = ni)
>     })
>     args <- c(list(x), sel, drop = FALSE)
>     do.call("[", args)
> }
>
>
> I think this precludes the need for a separate data.frame method at all,
> actually, though (I would think) tail.data.frame would still be defined and
> exported for backwards compatibility. (the matrix method has some extra
> bits so my current conception of it is still separate, though it might not
> NEED to be).
>
> The question then becomes, should head/tail always return something with
> the same dimensionally (number of dims) it got, or should data.frame and
> matrix be special cased in this regard, as they are now?
>
> What are people's thoughts?
> ~G
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

2019-Oct-30 11:29 UTC

head link

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

>>>>> Gabriel Becker 
>>>>>     on Tue, 29 Oct 2019 12:43:15 -0700 writes:
    > Hi all,
    > So I've started working on this and I ran into something that I
didn't
    > know, namely that for x a multi-dimensional (2+) array, head(x) and
tail(x)
    > ignore dimension completely, treat x as an atomic vector, and return an
    > (unclassed) atomic vector:

Well, that's  (3+), not "2+" .

But I did write (on Sep 17 in this thread!)

  > The current source for head() and tail() and all their methods
  > in utils is just 83 lines of code  {file utils/R/head.R minus
  > the initial mostly copyright comments}.

and if've ever looked at these few dozen of R code lines, you'll
have seen that we just added two simple utilities with a few
reasonable simple methods.  To treat non-matrix (i.e. non-2d)
arrays as vectors, is typically not unreasonable in R, but
indeed with your proposals (in this thread), such non-2d arrays
should be treated differently either via new  head.array() /
tail.array() methods ((or -- only if it can be done more nicely -- by
the default method)).

Note however the following  historical quirk :
> sapply(setNames(,1:5), function(K) inherits(array(pi, dim=1:K),
"array"))    1     2     3     4     5 
 TRUE FALSE  TRUE  TRUE  TRUE 

(Is this something we should consider changing for R 4.0.0 -- to
 have it TRUE also for 2d-arrays aka matrix objects ??)

The consequence of that is that
currently, "often"   foo.matrix is just a copy of foo.array  in
the case the latter exists:
"base" examples: foo in {unique, duplicated, anyDuplicated}.

So I propose you change current  head.matrix and tail.matrix  to
head.array and tail.array
(and then have   head.matrix <- head.array  etc, at least if the
 above quirk must remain, or remains (which I currently guess to
 be the case)).


    >> x = array(100, c(4, 5, 5))

    >> dim(x)

    > [1] 4 5 5

    >> head(x, 1)

    > [1] 100

    >> class(head(x))

    > [1] "numeric"


    > (For a 1d array, it does return another 1d array).

    > When extending head/tail to understand multiple dimensions as discussed
in
    > this thread, then, should the behavior for 2+d arrays be explicitly
    > retained, or should head and tail do the analogous thing (with a
head(<2d
    array> ) behaving the same as head(<matrix>), which honestly is
what I
    > expected to already be happening)?

    > Are people using/relying on this behavior in their code, and if so,
why/for
    > what?

    > Even more generally, one way forward is to have the default methods
check
    > for dimensions, and use length if it is null:

    > tail.default <- tail.data.frame <- function(x, n = 6L, ...)
    > {
    > if(any(n == 0))
    > stop("n must be non-zero or unspecified for all dimensions")
    > if(!is.null(dim(x)))
    > dimsx <- dim(x)
    > else
    > dimsx <- length(x)

    > ## this returns a list of vectors of indices in each
    > ## dimension, regardless of length of the the n
    > ## argument
    > sel <- lapply(seq_along(dimsx), function(i) {
    > dxi <- dimsx[i]
    > ## select all indices (full dim) if not specified
    > ni <- if(length(n) >= i) n[i] else dxi
    > ## handle negative ns
    > ni <- if (ni < 0L) max(dxi + ni, 0L) else min(ni, dxi)
    > seq.int(to = dxi, length.out = ni)
    > })
    > args <- c(list(x), sel, drop = FALSE)
    > do.call("[", args)
    > }


    > I think this precludes the need for a separate data.frame method at
all,
    > actually, though (I would think) tail.data.frame would still be defined
and
    > exported for backwards compatibility. (the matrix method has some extra
    > bits so my current conception of it is still separate, though it might
not
    > NEED to be).

    > The question then becomes, should head/tail always return something
with
    > the same dimensionally (number of dims) it got, or should data.frame
and
    > matrix be special cased in this regard, as they are now?

    > What are people's thoughts?
    > ~G

    > [[alternative HTML version deleted]]

Gabriel Becker

2019-Oct-31 19:46 UTC

head link

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

Hi Martin,


On Wed, Oct 30, 2019 at 4:30 AM Martin Maechler <maechler at
stat.math.ethz.ch>
wrote:
> >>>>> Gabriel Becker
> >>>>>     on Tue, 29 Oct 2019 12:43:15 -0700 writes:
>
>     > Hi all,
>     > So I've started working on this and I ran into something that
I
> didn't
>     > know, namely that for x a multi-dimensional (2+) array, head(x)
and
> tail(x)
>     > ignore dimension completely, treat x as an atomic vector, and
return
> an
>     > (unclassed) atomic vector:
>
> Well, that's  (3+), not "2+" .
>
You're correct, of course. Apologies for that.
>
> But I did write (on Sep 17 in this thread!)
>
>   > The current source for head() and tail() and all their methods
>   > in utils is just 83 lines of code  {file utils/R/head.R minus
>   > the initial mostly copyright comments}.
>
> and if've ever looked at these few dozen of R code lines, you'll
> have seen that we just added two simple utilities with a few
> reasonable simple methods.  To treat non-matrix (i.e. non-2d)
> arrays as vectors, is typically not unreasonable in R, but
> indeed with your proposals (in this thread), such non-2d arrays
> should be treated differently either via new  head.array() /
> tail.array() methods ((or -- only if it can be done more nicely -- by
> the default method)).
>
I hope you didn't construe my describing surprise (which was honest)  as a
criticism. It just quite literally not what I thought head(array(100, c(25,
2, 2))) would have done based on what head.matrix does is all.

>
> Note however the following  historical quirk :
>
> > sapply(setNames(,1:5), function(K) inherits(array(pi, dim=1:K),
"array"))
>     1     2     3     4     5
>  TRUE FALSE  TRUE  TRUE  TRUE
>
> (Is this something we should consider changing for R 4.0.0 -- to
>  have it TRUE also for 2d-arrays aka matrix objects ??)
>
That is pretty odd. IMHO It would be quite nice from a design perspective
to fix that, but I do wonder, as I infer you do as well, how much code it
would break.

Changing this would cause problems in any case where a generic has an array
method but no matrix method, as well as any code that explicitly checks for
inherits from "array" assuming matrices won't return true,
correct? My
intuition is that the former would be pretty rare, though it might be a fun
little problem to figure it out.  The latter is ...probably also fairly
rare? My intuition on that one is less strong though.

>
> The consequence of that is that
> currently, "often"   foo.matrix is just a copy of foo.array  in
> the case the latter exists:
> "base" examples: foo in {unique, duplicated, anyDuplicated}.
>
> So I propose you change current  head.matrix and tail.matrix  to
> head.array and tail.array
> (and then have   head.matrix <- head.array  etc, at least if the
>  above quirk must remain, or remains (which I currently guess to
>  be the case)).
>
>
Absolutely, will do. I'm gratified we're going after the more general
approach. Thanks for working with us on this.

Best,
~G

>
>     >> x = array(100, c(4, 5, 5))
>
>     >> dim(x)
>
>     > [1] 4 5 5
>
>     >> head(x, 1)
>
>     > [1] 100
>
>     >> class(head(x))
>
>     > [1] "numeric"
>
>
>     > (For a 1d array, it does return another 1d array).
>
>     > When extending head/tail to understand multiple dimensions as
> discussed in
>     > this thread, then, should the behavior for 2+d arrays be
explicitly
>     > retained, or should head and tail do the analogous thing (with a
> head(<2d
>     array> ) behaving the same as head(<matrix>), which honestly
is what I
>     > expected to already be happening)?
>
>     > Are people using/relying on this behavior in their code, and if
so,
> why/for
>     > what?
>
>     > Even more generally, one way forward is to have the default
methods
> check
>     > for dimensions, and use length if it is null:
>
>     > tail.default <- tail.data.frame <- function(x, n = 6L, ...)
>     > {
>     > if(any(n == 0))
>     > stop("n must be non-zero or unspecified for all
dimensions")
>     > if(!is.null(dim(x)))
>     > dimsx <- dim(x)
>     > else
>     > dimsx <- length(x)
>
>     > ## this returns a list of vectors of indices in each
>     > ## dimension, regardless of length of the the n
>     > ## argument
>     > sel <- lapply(seq_along(dimsx), function(i) {
>     > dxi <- dimsx[i]
>     > ## select all indices (full dim) if not specified
>     > ni <- if(length(n) >= i) n[i] else dxi
>     > ## handle negative ns
>     > ni <- if (ni < 0L) max(dxi + ni, 0L) else min(ni, dxi)
>     > seq.int(to = dxi, length.out = ni)
>     > })
>     > args <- c(list(x), sel, drop = FALSE)
>     > do.call("[", args)
>     > }
>
>
>     > I think this precludes the need for a separate data.frame method
at
> all,
>     > actually, though (I would think) tail.data.frame would still be
> defined and
>     > exported for backwards compatibility. (the matrix method has some
> extra
>     > bits so my current conception of it is still separate, though it
> might not
>     > NEED to be).
>
>     > The question then becomes, should head/tail always return
something
> with
>     > the same dimensionally (number of dims) it got, or should
data.frame
> and
>     > matrix be special cased in this regard, as they are now?
>
>     > What are people's thoughts?
>     > ~G
>
>     > [[alternative HTML version deleted]]
>
>
	[[alternative HTML version deleted]]

Pages, Herve

2019-Oct-31 21:02 UTC

head link

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

On 10/30/19 04:29, Martin Maechler wrote:>>>>>> Gabriel Becker
>>>>>>      on Tue, 29 Oct 2019 12:43:15 -0700 writes:
> 
>      > Hi all,
>      > So I've started working on this and I ran into something that
I didn't
>      > know, namely that for x a multi-dimensional (2+) array, head(x)
and tail(x)
>      > ignore dimension completely, treat x as an atomic vector, and
return an
>      > (unclassed) atomic vector:
> 
> Well, that's  (3+), not "2+" .
> 
> But I did write (on Sep 17 in this thread!)
> 
>    > The current source for head() and tail() and all their methods
>    > in utils is just 83 lines of code  {file utils/R/head.R minus
>    > the initial mostly copyright comments}.
> 
> and if've ever looked at these few dozen of R code lines, you'll
> have seen that we just added two simple utilities with a few
> reasonable simple methods.  To treat non-matrix (i.e. non-2d)
> arrays as vectors, is typically not unreasonable in R, but
> indeed with your proposals (in this thread), such non-2d arrays
> should be treated differently either via new  head.array() /
> tail.array() methods ((or -- only if it can be done more nicely -- by
> the default method)).
> 
> Note however the following  historical quirk :
> 
>> sapply(setNames(,1:5), function(K) inherits(array(pi, dim=1:K),
"array"))
>      1     2     3     4     5
>   TRUE FALSE  TRUE  TRUE  TRUE
> 
> (Is this something we should consider changing for R 4.0.0 -- to
>   have it TRUE also for 2d-arrays aka matrix objects ??)
That would be awesome! More generally I wonder how feasible it would be 
to fix all these inheritance quirks where inherits(x, "something"), 
is(x, "something"), and is.something(x) disagree. They've been
such a
nuisance for so many years...

Thanks,
H.

> 
> The consequence of that is that
> currently, "often"   foo.matrix is just a copy of foo.array  in
> the case the latter exists:
> "base" examples: foo in {unique, duplicated, anyDuplicated}.
> 
> So I propose you change current  head.matrix and tail.matrix  to
> head.array and tail.array
> (and then have   head.matrix <- head.array  etc, at least if the
>   above quirk must remain, or remains (which I currently guess to
>   be the case)).
> 
> 
>      >> x = array(100, c(4, 5, 5))
> 
>      >> dim(x)
> 
>      > [1] 4 5 5
> 
>      >> head(x, 1)
> 
>      > [1] 100
> 
>      >> class(head(x))
> 
>      > [1] "numeric"
> 
> 
>      > (For a 1d array, it does return another 1d array).
> 
>      > When extending head/tail to understand multiple dimensions as
discussed in
>      > this thread, then, should the behavior for 2+d arrays be
explicitly
>      > retained, or should head and tail do the analogous thing (with a
head(<2d
>      array> ) behaving the same as head(<matrix>), which honestly
is what I
>      > expected to already be happening)?
> 
>      > Are people using/relying on this behavior in their code, and if
so, why/for
>      > what?
> 
>      > Even more generally, one way forward is to have the default
methods check
>      > for dimensions, and use length if it is null:
> 
>      > tail.default <- tail.data.frame <- function(x, n = 6L, ...)
>      > {
>      > if(any(n == 0))
>      > stop("n must be non-zero or unspecified for all
dimensions")
>      > if(!is.null(dim(x)))
>      > dimsx <- dim(x)
>      > else
>      > dimsx <- length(x)
> 
>      > ## this returns a list of vectors of indices in each
>      > ## dimension, regardless of length of the the n
>      > ## argument
>      > sel <- lapply(seq_along(dimsx), function(i) {
>      > dxi <- dimsx[i]
>      > ## select all indices (full dim) if not specified
>      > ni <- if(length(n) >= i) n[i] else dxi
>      > ## handle negative ns
>      > ni <- if (ni < 0L) max(dxi + ni, 0L) else min(ni, dxi)
>      > seq.int(to = dxi, length.out = ni)
>      > })
>      > args <- c(list(x), sel, drop = FALSE)
>      > do.call("[", args)
>      > }
> 
> 
>      > I think this precludes the need for a separate data.frame method
at all,
>      > actually, though (I would think) tail.data.frame would still be
defined and
>      > exported for backwards compatibility. (the matrix method has some
extra
>      > bits so my current conception of it is still separate, though it
might not
>      > NEED to be).
> 
>      > The question then becomes, should head/tail always return
something with
>      > the same dimensionally (number of dims) it got, or should
data.frame and
>      > matrix be special cased in this regard, as they are now?
> 
>      > What are people's thoughts?
>      > ~G
> 
>      > [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=Xl_11U8w8hVRbuqAPQkz0uSW02kokK9EUPhOopxw0d8&s=vyKU4VkWLb_fGG6KeDPPjVM5_nLhav6UiX7NkzgqsuE&e>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

peter dalgaard

2019-Oct-31 22:04 UTC

head link

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

Hmm, the problem I see here is that these implied classes are all inherently
one-off. We also have
> inherits(matrix(1,1,1),"numeric")
[1] FALSE> is.numeric(matrix(1,1,1))
[1] TRUE> inherits(1L,"numeric")
[1] FALSE> is.numeric(1L)[1] TRUE

and if we start fixing one, we might need to fix all. 

For method dispatch, we do have inheritance, e.g.
> foo.numeric <- function(x) x + 1
> foo <- function(x) UseMethod("foo")
> foo(1)
[1] 2> foo(1L)
[1] 2> foo(matrix(1,1,1))     [,1]
[1,]    2> foo.integer <- function(x) x + 2
> foo(1)
[1] 2> foo(1L)
[1] 3> foo(matrix(1,1,1))     [,1]
[1,]    2> foo(matrix(1L,1,1))     [,1]
[1,]    3

but these are not all automatic: "integer" implies
"numeric", but "matrix" does not imply "numeric",
much less "integer".

Also, we seem to have a rule that inherits(x, c) iff c %in% class(x), which
would break -- unless we change class(x) to return the whole set of inherited
classes, which I sense that we'd rather not do....

-pd
> On 30 Oct 2019, at 12:29 , Martin Maechler <maechler at
stat.math.ethz.ch> wrote:
> 
> Note however the following  historical quirk :
> 
>> sapply(setNames(,1:5), function(K) inherits(array(pi, dim=1:K),
"array"))
>    1     2     3     4     5 
> TRUE FALSE  TRUE  TRUE  TRUE 
> 
> (Is this something we should consider changing for R 4.0.0 -- to
> have it TRUE also for 2d-arrays aka matrix objects ??)
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Oct 2019 - head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

Apparently Analagous Threads