thr3ads.net - R devel - [Rd] Subsetting the "ROW"s of an object [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Berry, Charles

2018-Jun-08 18:38 UTC

[Rd] Subsetting the "ROW"s of an object

> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at fredhutch.org>
wrote:
> 
> Also the TRUEs cause problems if some dimensions are 0:
> 
>  > matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>  Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>    (subscript) logical subscript too long
OK. But this is easy enough to handle. 
> 
> H.
> 
> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>> I suspect this will have suboptimal performance since the TRUEs will
>> get recycled. (Maybe there is, or could be, ALTREP, support for
>> recycling)
>> Hadley

AFAICS, it is not an issue. Taking

arr <- array(rnorm(2^22),c(2^10,4,4,4))

as a test case 

and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':

subset_ROW4 <-
     function(x, i, useLiteral=FALSE)
{
    literal <- quote(x[i,,,,drop=FALSE])
    mc <- quote(x[i])
    nd <- max(1L, length(dim(x)))
    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
    mc[["drop"]] <- FALSE
    if (useLiteral)
        eval(literal)
    else
        eval(mc)
 }

I get identical times with

system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE))

and with 

system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE))

Changing the dimensions to c(2^5, 2^7, 4, 4 ) and running something similar also
shows equal times.

Chuck
>> On Fri, Jun 8, 2018 at 10:16 AM, Berry, Charles <ccberry at
ucsd.edu> wrote:
>>> 
>>> 
>>>> On Jun 8, 2018, at 8:45 AM, Hadley Wickham <h.wickham at
gmail.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Is there a better to way to subset the ROWs (in the sense of
NROW) of
>>>> an vector, matrix, data frame or array than this?
>>> 
>>> 
>>> You can use TRUE to fill the subscripts for dimensions 2:nd
>>> 
>>>> 
>>>> subset_ROW <- function(x, i) {
>>>>  nd <- length(dim(x))
>>>>  if (nd <= 1L) {
>>>>    x[i]
>>>>  } else {
>>>>    dims <- rep(list(quote(expr = )), nd - 1L)
>>>>    do.call(`[`, c(list(quote(x), quote(i)), dims, list(drop =
FALSE)))
>>>>  }
>>>> }
>>> 
>>> 
>>> subset_ROW <-
>>>     function(x,i)
>>> {
>>>     mc <- quote(x[i])
>>>     nd <- max(1L, length(dim(x)))
>>>     mc[seq(4, length=nd-1L)] <- rep(list(TRUE), nd - 1L)
>>>     mc[["drop"]] <- FALSE
>>>     eval(mc)
>>> 
>>> }
>>> 
>>>> 
>>>> subset_ROW(1:10, 4:6)
>>>> #> [1] 4 5 6
>>>> 
>>>> str(subset_ROW(array(1:10, c(10)), 2:4))
>>>> #>  int [1:3(1d)] 2 3 4
>>>> str(subset_ROW(array(1:10, c(10, 1)), 2:4))
>>>> #>  int [1:3, 1] 2 3 4
>>>> str(subset_ROW(array(1:10, c(5, 2)), 2:4))
>>>> #>  int [1:3, 1:2] 2 3 4 7 8 9
>>>> str(subset_ROW(array(1:10, c(10, 1, 1)), 2:4))
>>>> #>  int [1:3, 1, 1] 2 3 4
>>>> 
>>>> subset_ROW(data.frame(x = 1:10, y = 10:1), 2:4)
>>>> #>   x y
>>>> #> 2 2 9
>>>> #> 3 3 8
>>>> #> 4 4 7
>>>> 
>>> 
>>> HTH,
>>> 
>>> Chuck
>>> 
> 
> -- 
> Herv? Pag?s
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

Hadley Wickham

2018-Jun-08 18:52 UTC

head link

[Rd] Subsetting the "ROW"s of an object

On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu>
wrote:>
>
>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>
>> Also the TRUEs cause problems if some dimensions are 0:
>>
>>  > matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>  Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>    (subscript) logical subscript too long
>
> OK. But this is easy enough to handle.
>
>>
>> H.
>>
>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>> I suspect this will have suboptimal performance since the TRUEs
will
>>> get recycled. (Maybe there is, or could be, ALTREP, support for
>>> recycling)
>>> Hadley
>
>
> AFAICS, it is not an issue. Taking
>
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>
> as a test case
>
> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>
> subset_ROW4 <-
>      function(x, i, useLiteral=FALSE)
> {
>     literal <- quote(x[i,,,,drop=FALSE])
>     mc <- quote(x[i])
>     nd <- max(1L, length(dim(x)))
>     mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>     mc[["drop"]] <- FALSE
>     if (useLiteral)
>         eval(literal)
>     else
>         eval(mc)
>  }
>
> I get identical times with
>
> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>
> and with
>
> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
I think that's because you used a relatively low precision timing
mechnaism, and included the index generation in the timing. I see:

arr <- array(rnorm(2^22),c(2^10,4,4,4))
i <- seq(1,length = 10, by = 100)

bench::mark(
  arr[i, TRUE, TRUE, TRUE],
  arr[i, , , ]
)
#> # A tibble: 2 x 1
#>   expression        min    mean   median      max  n_gc
#>   <chr>         <bch:t> <bch:t> <bch:tm>
<bch:tm> <dbl>
#> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
#> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2

So not a huge difference, but it's there.

Hadley


-- 
http://hadley.nz

Hervé Pagès

2018-Jun-08 19:13 UTC

head link

[Rd] Subsetting the "ROW"s of an object

A missing subscript is still preferable to a TRUE though because it
carries the meaning "take it all". A TRUE also achieves this but via
implicit recycling. For example x[ , , ] and x[TRUE, TRUE, TRUE]
achieve the same thing (if length(x) != 0) and are both no-ops but
the subsetting code gets a chance to immediately and easily detect
the former as a no-op whereas it will probably not be able to do it
so easily for the latter. So in this case it will most likely generate
a copy of 'x' and fill the new array by taking a full walk on it.

H.

On 06/08/2018 11:52 AM, Hadley Wickham wrote:> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu>
wrote:
>>
>>
>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>>
>>> Also the TRUEs cause problems if some dimensions are 0:
>>>
>>>   > matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>>   Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>     (subscript) logical subscript too long
>>
>> OK. But this is easy enough to handle.
>>
>>>
>>> H.
>>>
>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>> I suspect this will have suboptimal performance since the TRUEs
will
>>>> get recycled. (Maybe there is, or could be, ALTREP, support for
>>>> recycling)
>>>> Hadley
>>
>>
>> AFAICS, it is not an issue. Taking
>>
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>
>> as a test case
>>
>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>>
>> subset_ROW4 <-
>>       function(x, i, useLiteral=FALSE)
>> {
>>      literal <- quote(x[i,,,,drop=FALSE])
>>      mc <- quote(x[i])
>>      nd <- max(1L, length(dim(x)))
>>      mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>      mc[["drop"]] <- FALSE
>>      if (useLiteral)
>>          eval(literal)
>>      else
>>          eval(mc)
>>   }
>>
>> I get identical times with
>>
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>>
>> and with
>>
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
> 
> I think that's because you used a relatively low precision timing
> mechnaism, and included the index generation in the timing. I see:
> 
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
> i <- seq(1,length = 10, by = 100)
> 
> bench::mark(
>    arr[i, TRUE, TRUE, TRUE],
>    arr[i, , , ]
> )
> #> # A tibble: 2 x 1
> #>   expression        min    mean   median      max  n_gc
> #>   <chr>         <bch:t> <bch:t> <bch:tm>
<bch:tm> <dbl>
> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
> 
> So not a huge difference, but it's there.
> 
> Hadley
> 
> 
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Berry, Charles

2018-Jun-08 19:31 UTC

head link

[Rd] Subsetting the "ROW"s of an object

> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu>
wrote:
>> 
>> 
>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>> 
>>> Also the TRUEs cause problems if some dimensions are 0:
>>> 
>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>   (subscript) logical subscript too long
>> 
>> OK. But this is easy enough to handle.
>> 
>>> 
>>> H.
>>> 
>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>> I suspect this will have suboptimal performance since the TRUEs
will
>>>> get recycled. (Maybe there is, or could be, ALTREP, support for
>>>> recycling)
>>>> Hadley
>> 
>> 
>> AFAICS, it is not an issue. Taking
>> 
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> 
>> as a test case
>> 
>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>> 
>> subset_ROW4 <-
>>     function(x, i, useLiteral=FALSE)
>> {
>>    literal <- quote(x[i,,,,drop=FALSE])
>>    mc <- quote(x[i])
>>    nd <- max(1L, length(dim(x)))
>>    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>    mc[["drop"]] <- FALSE
>>    if (useLiteral)
>>        eval(literal)
>>    else
>>        eval(mc)
>> }
>> 
>> I get identical times with
>> 
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>> 
>> and with
>> 
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
> 
> I think that's because you used a relatively low precision timing
> mechnaism, and included the index generation in the timing. I see:
> 
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
> i <- seq(1,length = 10, by = 100)
> 
> bench::mark(
>  arr[i, TRUE, TRUE, TRUE],
>  arr[i, , , ]
> )
> #> # A tibble: 2 x 1
> #>   expression        min    mean   median      max  n_gc
> #>   <chr>         <bch:t> <bch:t> <bch:tm>
<bch:tm> <dbl>
> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
> 
> So not a huge difference, but it's there.

Funny. I get similar results to yours above albeit with smaller differences.
Usually < 5 percent.

But with subset_ROW4 I see no consistent difference.

In this example, it runs faster on average using `eval(mc)' to return the
result:
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
> i <- seq(1,length=10,by=100)
> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8]# A tibble: 2 x 8
  expression                      min     mean   median      max `itr/sec`
mem_alloc  n_gc
  <chr>                      <bch:tm> <bch:tm> <bch:tm>
<bch:tm>     <dbl> <bch:byt> <dbl>
1 subset_ROW4(arr, i, FALSE)   28.9?s   34.9?s   32.1?s   1.36ms    28686.   
5.05KB     5
2 subset_ROW4(arr, i, TRUE)    28.9?s     35?s   32.4?s 875.11?s    28572.   
5.05KB     5>
And on subsequent reps the lead switches back and forth.


Chuck

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jun 2018 - Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

Seemingly Similar Threads