thr3ads.net - R devel - [Rd] Subsetting the "ROW"s of an object [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Berry, Charles

2018-Jun-08 19:31 UTC

[Rd] Subsetting the "ROW"s of an object

> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu>
wrote:
>> 
>> 
>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>> 
>>> Also the TRUEs cause problems if some dimensions are 0:
>>> 
>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>   (subscript) logical subscript too long
>> 
>> OK. But this is easy enough to handle.
>> 
>>> 
>>> H.
>>> 
>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>> I suspect this will have suboptimal performance since the TRUEs
will
>>>> get recycled. (Maybe there is, or could be, ALTREP, support for
>>>> recycling)
>>>> Hadley
>> 
>> 
>> AFAICS, it is not an issue. Taking
>> 
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> 
>> as a test case
>> 
>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>> 
>> subset_ROW4 <-
>>     function(x, i, useLiteral=FALSE)
>> {
>>    literal <- quote(x[i,,,,drop=FALSE])
>>    mc <- quote(x[i])
>>    nd <- max(1L, length(dim(x)))
>>    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>    mc[["drop"]] <- FALSE
>>    if (useLiteral)
>>        eval(literal)
>>    else
>>        eval(mc)
>> }
>> 
>> I get identical times with
>> 
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>> 
>> and with
>> 
>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
> 
> I think that's because you used a relatively low precision timing
> mechnaism, and included the index generation in the timing. I see:
> 
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
> i <- seq(1,length = 10, by = 100)
> 
> bench::mark(
>  arr[i, TRUE, TRUE, TRUE],
>  arr[i, , , ]
> )
> #> # A tibble: 2 x 1
> #>   expression        min    mean   median      max  n_gc
> #>   <chr>         <bch:t> <bch:t> <bch:tm>
<bch:tm> <dbl>
> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
> 
> So not a huge difference, but it's there.

Funny. I get similar results to yours above albeit with smaller differences.
Usually < 5 percent.

But with subset_ROW4 I see no consistent difference.

In this example, it runs faster on average using `eval(mc)' to return the
result:
> arr <- array(rnorm(2^22),c(2^10,4,4,4))
> i <- seq(1,length=10,by=100)
> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8]# A tibble: 2 x 8
  expression                      min     mean   median      max `itr/sec`
mem_alloc  n_gc
  <chr>                      <bch:tm> <bch:tm> <bch:tm>
<bch:tm>     <dbl> <bch:byt> <dbl>
1 subset_ROW4(arr, i, FALSE)   28.9?s   34.9?s   32.1?s   1.36ms    28686.   
5.05KB     5
2 subset_ROW4(arr, i, TRUE)    28.9?s     35?s   32.4?s 875.11?s    28572.   
5.05KB     5>
And on subsequent reps the lead switches back and forth.


Chuck

Hadley Wickham

2018-Jun-08 20:49 UTC

head link

[Rd] Subsetting the "ROW"s of an object

Hmmm, yes, there must be some special case in the C code to avoid
recycling a length-1 logical vector:

dims <- c(4, 4, 4, 1e5)

arr <- array(rnorm(prod(dims)), dims)
dim(arr)
#> [1]      4      4      4 100000
i <- c(1, 3)

bench::mark(
  arr[i, TRUE, TRUE, TRUE],
  arr[i, , , ]
)[c("expression", "min", "mean", "max")]
#> # A tibble: 2 x 4
#>   expression                    min     mean      max
#>   <chr>                    <bch:tm> <bch:tm>
<bch:tm>
#> 1 arr[i, TRUE, TRUE, TRUE]   41.8ms   43.6ms   46.5ms
#> 2 arr[i, , , ]               41.7ms   43.1ms   46.3ms


On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu>
wrote:>
>
>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at
gmail.com> wrote:
>>
>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at
ucsd.edu> wrote:
>>>
>>>
>>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>>>
>>>> Also the TRUEs cause problems if some dimensions are 0:
>>>>
>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>>   (subscript) logical subscript too long
>>>
>>> OK. But this is easy enough to handle.
>>>
>>>>
>>>> H.
>>>>
>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>>> I suspect this will have suboptimal performance since the
TRUEs will
>>>>> get recycled. (Maybe there is, or could be, ALTREP, support
for
>>>>> recycling)
>>>>> Hadley
>>>
>>>
>>> AFAICS, it is not an issue. Taking
>>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>>
>>> as a test case
>>>
>>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>>>
>>> subset_ROW4 <-
>>>     function(x, i, useLiteral=FALSE)
>>> {
>>>    literal <- quote(x[i,,,,drop=FALSE])
>>>    mc <- quote(x[i])
>>>    nd <- max(1L, length(dim(x)))
>>>    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>>    mc[["drop"]] <- FALSE
>>>    if (useLiteral)
>>>        eval(literal)
>>>    else
>>>        eval(mc)
>>> }
>>>
>>> I get identical times with
>>>
>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>>>
>>> and with
>>>
>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
>>
>> I think that's because you used a relatively low precision timing
>> mechnaism, and included the index generation in the timing. I see:
>>
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> i <- seq(1,length = 10, by = 100)
>>
>> bench::mark(
>>  arr[i, TRUE, TRUE, TRUE],
>>  arr[i, , , ]
>> )
>> #> # A tibble: 2 x 1
>> #>   expression        min    mean   median      max  n_gc
>> #>   <chr>         <bch:t> <bch:t> <bch:tm>
<bch:tm> <dbl>
>> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
>> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
>>
>> So not a huge difference, but it's there.
>
>
> Funny. I get similar results to yours above albeit with smaller
differences. Usually < 5 percent.
>
> But with subset_ROW4 I see no consistent difference.
>
> In this example, it runs faster on average using `eval(mc)' to return
the result:
>
>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>> i <- seq(1,length=10,by=100)
>> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8]
> # A tibble: 2 x 8
>   expression                      min     mean   median      max `itr/sec`
mem_alloc  n_gc
>   <chr>                      <bch:tm> <bch:tm>
<bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
> 1 subset_ROW4(arr, i, FALSE)   28.9?s   34.9?s   32.1?s   1.36ms    28686. 
5.05KB     5
> 2 subset_ROW4(arr, i, TRUE)    28.9?s     35?s   32.4?s 875.11?s    28572. 
5.05KB     5
>>
>
> And on subsequent reps the lead switches back and forth.
>
>
> Chuck
>


-- 
http://hadley.nz

Michael Lawrence

2018-Jun-08 20:56 UTC

head link

[Rd] Subsetting the "ROW"s of an object

Actually, it's sort of the opposite. Everything becomes a sequence of
integers internally, even when the argument is missing. So the same
amount of work is done, basically. ALTREP will let us improve this
sort of thing.

Michael

On Fri, Jun 8, 2018 at 1:49 PM, Hadley Wickham <h.wickham at gmail.com>
wrote:> Hmmm, yes, there must be some special case in the C code to avoid
> recycling a length-1 logical vector:
>
> dims <- c(4, 4, 4, 1e5)
>
> arr <- array(rnorm(prod(dims)), dims)
> dim(arr)
> #> [1]      4      4      4 100000
> i <- c(1, 3)
>
> bench::mark(
>   arr[i, TRUE, TRUE, TRUE],
>   arr[i, , , ]
> )[c("expression", "min", "mean",
"max")]
> #> # A tibble: 2 x 4
> #>   expression                    min     mean      max
> #>   <chr>                    <bch:tm> <bch:tm>
<bch:tm>
> #> 1 arr[i, TRUE, TRUE, TRUE]   41.8ms   43.6ms   46.5ms
> #> 2 arr[i, , , ]               41.7ms   43.1ms   46.3ms
>
>
> On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu>
wrote:
>>
>>
>>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at
gmail.com> wrote:
>>>
>>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at
ucsd.edu> wrote:
>>>>
>>>>
>>>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>>>>
>>>>> Also the TRUEs cause problems if some dimensions are 0:
>>>>>
>>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>>>   (subscript) logical subscript too long
>>>>
>>>> OK. But this is easy enough to handle.
>>>>
>>>>>
>>>>> H.
>>>>>
>>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>>>> I suspect this will have suboptimal performance since
the TRUEs will
>>>>>> get recycled. (Maybe there is, or could be, ALTREP,
support for
>>>>>> recycling)
>>>>>> Hadley
>>>>
>>>>
>>>> AFAICS, it is not an issue. Taking
>>>>
>>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>>>
>>>> as a test case
>>>>
>>>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>>>>
>>>> subset_ROW4 <-
>>>>     function(x, i, useLiteral=FALSE)
>>>> {
>>>>    literal <- quote(x[i,,,,drop=FALSE])
>>>>    mc <- quote(x[i])
>>>>    nd <- max(1L, length(dim(x)))
>>>>    mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>>>    mc[["drop"]] <- FALSE
>>>>    if (useLiteral)
>>>>        eval(literal)
>>>>    else
>>>>        eval(mc)
>>>> }
>>>>
>>>> I get identical times with
>>>>
>>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>>>>
>>>> and with
>>>>
>>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
>>>
>>> I think that's because you used a relatively low precision
timing
>>> mechnaism, and included the index generation in the timing. I see:
>>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>> i <- seq(1,length = 10, by = 100)
>>>
>>> bench::mark(
>>>  arr[i, TRUE, TRUE, TRUE],
>>>  arr[i, , , ]
>>> )
>>> #> # A tibble: 2 x 1
>>> #>   expression        min    mean   median      max  n_gc
>>> #>   <chr>         <bch:t> <bch:t>
<bch:tm> <bch:tm> <dbl>
>>> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
>>> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
>>>
>>> So not a huge difference, but it's there.
>>
>>
>> Funny. I get similar results to yours above albeit with smaller
differences. Usually < 5 percent.
>>
>> But with subset_ROW4 I see no consistent difference.
>>
>> In this example, it runs faster on average using `eval(mc)' to
return the result:
>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>> i <- seq(1,length=10,by=100)
>>> bench::mark(subset_ROW4(arr,i,FALSE),
subset_ROW4(arr,i,TRUE))[,1:8]
>> # A tibble: 2 x 8
>>   expression                      min     mean   median      max
`itr/sec` mem_alloc  n_gc
>>   <chr>                      <bch:tm> <bch:tm>
<bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
>> 1 subset_ROW4(arr, i, FALSE)   28.9?s   34.9?s   32.1?s   1.36ms   
28686.    5.05KB     5
>> 2 subset_ROW4(arr, i, TRUE)    28.9?s     35?s   32.4?s 875.11?s   
28572.    5.05KB     5
>>>
>>
>> And on subsequent reps the lead switches back and forth.
>>
>>
>> Chuck
>>
>
>
>
> --
> http://hadley.nz
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Hervé Pagès

2018-Jun-08 21:01 UTC

head link

[Rd] Subsetting the "ROW"s of an object

The C code for subsetting doesn't need to recycle a logical subscript.
It only needs to walk on it and start again at the beginning of the
vector when it reaches the end. Not exactly the same as detecting the
"take everything along that dimension" situation though.
x[TRUE, TRUE, TRUE] triggers the full subsetting machinery when x[]
and x[ , , ] could (and should) easily avoid it.

H.

On 06/08/2018 01:49 PM, Hadley Wickham wrote:> Hmmm, yes, there must be some special case in the C code to avoid
> recycling a length-1 logical vector:
> 
> dims <- c(4, 4, 4, 1e5)
> 
> arr <- array(rnorm(prod(dims)), dims)
> dim(arr)
> #> [1]      4      4      4 100000
> i <- c(1, 3)
> 
> bench::mark(
>    arr[i, TRUE, TRUE, TRUE],
>    arr[i, , , ]
> )[c("expression", "min", "mean",
"max")]
> #> # A tibble: 2 x 4
> #>   expression                    min     mean      max
> #>   <chr>                    <bch:tm> <bch:tm>
<bch:tm>
> #> 1 arr[i, TRUE, TRUE, TRUE]   41.8ms   43.6ms   46.5ms
> #> 2 arr[i, , , ]               41.7ms   43.1ms   46.3ms
> 
> 
> On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu>
wrote:
>>
>>
>>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at
gmail.com> wrote:
>>>
>>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at
ucsd.edu> wrote:
>>>>
>>>>
>>>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>>>>>
>>>>> Also the TRUEs cause problems if some dimensions are 0:
>>>>>
>>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE]
>>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] :
>>>>>    (subscript) logical subscript too long
>>>>
>>>> OK. But this is easy enough to handle.
>>>>
>>>>>
>>>>> H.
>>>>>
>>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote:
>>>>>> I suspect this will have suboptimal performance since
the TRUEs will
>>>>>> get recycled. (Maybe there is, or could be, ALTREP,
support for
>>>>>> recycling)
>>>>>> Hadley
>>>>
>>>>
>>>> AFAICS, it is not an issue. Taking
>>>>
>>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>>>
>>>> as a test case
>>>>
>>>> and using a function that will either use the literal code
`x[i,,,,drop=FALSE]' or `eval(mc)':
>>>>
>>>> subset_ROW4 <-
>>>>      function(x, i, useLiteral=FALSE)
>>>> {
>>>>     literal <- quote(x[i,,,,drop=FALSE])
>>>>     mc <- quote(x[i])
>>>>     nd <- max(1L, length(dim(x)))
>>>>     mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L)
>>>>     mc[["drop"]] <- FALSE
>>>>     if (useLiteral)
>>>>         eval(literal)
>>>>     else
>>>>         eval(mc)
>>>> }
>>>>
>>>> I get identical times with
>>>>
>>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),TRUE))
>>>>
>>>> and with
>>>>
>>>> system.time(for (i in 1:10000)
subset_ROW4(arr,seq(1,length=10,by=100),FALSE))
>>>
>>> I think that's because you used a relatively low precision
timing
>>> mechnaism, and included the index generation in the timing. I see:
>>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>> i <- seq(1,length = 10, by = 100)
>>>
>>> bench::mark(
>>>   arr[i, TRUE, TRUE, TRUE],
>>>   arr[i, , , ]
>>> )
>>> #> # A tibble: 2 x 1
>>> #>   expression        min    mean   median      max  n_gc
>>> #>   <chr>         <bch:t> <bch:t>
<bch:tm> <bch:tm> <dbl>
>>> #> 1 arr[i, TRUE,?   7.4?s  10.9?s  10.66?s   1.22ms     2
>>> #> 2 arr[i, , , ]   7.06?s   8.8?s   7.85?s 538.09?s     2
>>>
>>> So not a huge difference, but it's there.
>>
>>
>> Funny. I get similar results to yours above albeit with smaller
differences. Usually < 5 percent.
>>
>> But with subset_ROW4 I see no consistent difference.
>>
>> In this example, it runs faster on average using `eval(mc)' to
return the result:
>>
>>> arr <- array(rnorm(2^22),c(2^10,4,4,4))
>>> i <- seq(1,length=10,by=100)
>>> bench::mark(subset_ROW4(arr,i,FALSE),
subset_ROW4(arr,i,TRUE))[,1:8]
>> # A tibble: 2 x 8
>>    expression                      min     mean   median      max
`itr/sec` mem_alloc  n_gc
>>    <chr>                      <bch:tm> <bch:tm>
<bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
>> 1 subset_ROW4(arr, i, FALSE)   28.9?s   34.9?s   32.1?s   1.36ms   
28686.    5.05KB     5
>> 2 subset_ROW4(arr, i, TRUE)    28.9?s     35?s   32.4?s 875.11?s   
28572.    5.05KB     5
>>>
>>
>> And on subsequent reps the lead switches back and forth.
>>
>>
>> Chuck
>>
> 
> 
> 
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Berry, Charles

2018-Jun-08 21:09 UTC

head link

[Rd] Subsetting the "ROW"s of an object

> On Jun 8, 2018, at 1:49 PM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> Hmmm, yes, there must be some special case in the C code to avoid
> recycling a length-1 logical vector:

Here is a version that (I think) handles Herve's issue of arrays having one
or more 0 dimensions.

subset_ROW <-
    function(x,i)
{
    dims <- dim(x)
    index_list <- which(dims[-1] != 0L) + 3
    mc <- quote(x[i])
    nd <- max(1L, length(dims))
    mc[ index_list ] <- list(TRUE)
    mc[[ nd + 3L ]] <- FALSE
    names( mc )[ nd+3L ] <- "drop"
    eval(mc)
}

Curiously enough the timing is *much* better for this implementation than for
the first version I sent.

Constructing a version of `mc' that looks like `x[i,,,,drop=FALSE]' can
be done with `alist(a=)' in place of `list(TRUE)' in the earlier version
but seems to slow things down noticeably. It requires almost twice (!!) as much
time as the version above.

Best,

Chuck

Possibly Parallel Threads

Search for more seemingly similar threads

R devel - Jun 2018 - Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

[Rd] Subsetting the "ROW"s of an object

Possibly Parallel Threads