> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com> wrote: > > On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu> wrote: >> >> >>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: >>> >>> Also the TRUEs cause problems if some dimensions are 0: >>> >>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE] >>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] : >>> (subscript) logical subscript too long >> >> OK. But this is easy enough to handle. >> >>> >>> H. >>> >>> On 06/08/2018 10:29 AM, Hadley Wickham wrote: >>>> I suspect this will have suboptimal performance since the TRUEs will >>>> get recycled. (Maybe there is, or could be, ALTREP, support for >>>> recycling) >>>> Hadley >> >> >> AFAICS, it is not an issue. Taking >> >> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >> >> as a test case >> >> and using a function that will either use the literal code `x[i,,,,drop=FALSE]' or `eval(mc)': >> >> subset_ROW4 <- >> function(x, i, useLiteral=FALSE) >> { >> literal <- quote(x[i,,,,drop=FALSE]) >> mc <- quote(x[i]) >> nd <- max(1L, length(dim(x))) >> mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L) >> mc[["drop"]] <- FALSE >> if (useLiteral) >> eval(literal) >> else >> eval(mc) >> } >> >> I get identical times with >> >> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE)) >> >> and with >> >> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE)) > > I think that's because you used a relatively low precision timing > mechnaism, and included the index generation in the timing. I see: > > arr <- array(rnorm(2^22),c(2^10,4,4,4)) > i <- seq(1,length = 10, by = 100) > > bench::mark( > arr[i, TRUE, TRUE, TRUE], > arr[i, , , ] > ) > #> # A tibble: 2 x 1 > #> expression min mean median max n_gc > #> <chr> <bch:t> <bch:t> <bch:tm> <bch:tm> <dbl> > #> 1 arr[i, TRUE,? 7.4?s 10.9?s 10.66?s 1.22ms 2 > #> 2 arr[i, , , ] 7.06?s 8.8?s 7.85?s 538.09?s 2 > > So not a huge difference, but it's there.Funny. I get similar results to yours above albeit with smaller differences. Usually < 5 percent. But with subset_ROW4 I see no consistent difference. In this example, it runs faster on average using `eval(mc)' to return the result:> arr <- array(rnorm(2^22),c(2^10,4,4,4)) > i <- seq(1,length=10,by=100) > bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8]# A tibble: 2 x 8 expression min mean median max `itr/sec` mem_alloc n_gc <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> 1 subset_ROW4(arr, i, FALSE) 28.9?s 34.9?s 32.1?s 1.36ms 28686. 5.05KB 5 2 subset_ROW4(arr, i, TRUE) 28.9?s 35?s 32.4?s 875.11?s 28572. 5.05KB 5>And on subsequent reps the lead switches back and forth. Chuck
Hmmm, yes, there must be some special case in the C code to avoid recycling a length-1 logical vector: dims <- c(4, 4, 4, 1e5) arr <- array(rnorm(prod(dims)), dims) dim(arr) #> [1] 4 4 4 100000 i <- c(1, 3) bench::mark( arr[i, TRUE, TRUE, TRUE], arr[i, , , ] )[c("expression", "min", "mean", "max")] #> # A tibble: 2 x 4 #> expression min mean max #> <chr> <bch:tm> <bch:tm> <bch:tm> #> 1 arr[i, TRUE, TRUE, TRUE] 41.8ms 43.6ms 46.5ms #> 2 arr[i, , , ] 41.7ms 43.1ms 46.3ms On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu> wrote:> > >> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com> wrote: >> >> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu> wrote: >>> >>> >>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: >>>> >>>> Also the TRUEs cause problems if some dimensions are 0: >>>> >>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE] >>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] : >>>> (subscript) logical subscript too long >>> >>> OK. But this is easy enough to handle. >>> >>>> >>>> H. >>>> >>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote: >>>>> I suspect this will have suboptimal performance since the TRUEs will >>>>> get recycled. (Maybe there is, or could be, ALTREP, support for >>>>> recycling) >>>>> Hadley >>> >>> >>> AFAICS, it is not an issue. Taking >>> >>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>> >>> as a test case >>> >>> and using a function that will either use the literal code `x[i,,,,drop=FALSE]' or `eval(mc)': >>> >>> subset_ROW4 <- >>> function(x, i, useLiteral=FALSE) >>> { >>> literal <- quote(x[i,,,,drop=FALSE]) >>> mc <- quote(x[i]) >>> nd <- max(1L, length(dim(x))) >>> mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L) >>> mc[["drop"]] <- FALSE >>> if (useLiteral) >>> eval(literal) >>> else >>> eval(mc) >>> } >>> >>> I get identical times with >>> >>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE)) >>> >>> and with >>> >>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE)) >> >> I think that's because you used a relatively low precision timing >> mechnaism, and included the index generation in the timing. I see: >> >> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >> i <- seq(1,length = 10, by = 100) >> >> bench::mark( >> arr[i, TRUE, TRUE, TRUE], >> arr[i, , , ] >> ) >> #> # A tibble: 2 x 1 >> #> expression min mean median max n_gc >> #> <chr> <bch:t> <bch:t> <bch:tm> <bch:tm> <dbl> >> #> 1 arr[i, TRUE,? 7.4?s 10.9?s 10.66?s 1.22ms 2 >> #> 2 arr[i, , , ] 7.06?s 8.8?s 7.85?s 538.09?s 2 >> >> So not a huge difference, but it's there. > > > Funny. I get similar results to yours above albeit with smaller differences. Usually < 5 percent. > > But with subset_ROW4 I see no consistent difference. > > In this example, it runs faster on average using `eval(mc)' to return the result: > >> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >> i <- seq(1,length=10,by=100) >> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8] > # A tibble: 2 x 8 > expression min mean median max `itr/sec` mem_alloc n_gc > <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> > 1 subset_ROW4(arr, i, FALSE) 28.9?s 34.9?s 32.1?s 1.36ms 28686. 5.05KB 5 > 2 subset_ROW4(arr, i, TRUE) 28.9?s 35?s 32.4?s 875.11?s 28572. 5.05KB 5 >> > > And on subsequent reps the lead switches back and forth. > > > Chuck >-- http://hadley.nz
Actually, it's sort of the opposite. Everything becomes a sequence of integers internally, even when the argument is missing. So the same amount of work is done, basically. ALTREP will let us improve this sort of thing. Michael On Fri, Jun 8, 2018 at 1:49 PM, Hadley Wickham <h.wickham at gmail.com> wrote:> Hmmm, yes, there must be some special case in the C code to avoid > recycling a length-1 logical vector: > > dims <- c(4, 4, 4, 1e5) > > arr <- array(rnorm(prod(dims)), dims) > dim(arr) > #> [1] 4 4 4 100000 > i <- c(1, 3) > > bench::mark( > arr[i, TRUE, TRUE, TRUE], > arr[i, , , ] > )[c("expression", "min", "mean", "max")] > #> # A tibble: 2 x 4 > #> expression min mean max > #> <chr> <bch:tm> <bch:tm> <bch:tm> > #> 1 arr[i, TRUE, TRUE, TRUE] 41.8ms 43.6ms 46.5ms > #> 2 arr[i, , , ] 41.7ms 43.1ms 46.3ms > > > On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu> wrote: >> >> >>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com> wrote: >>> >>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu> wrote: >>>> >>>> >>>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: >>>>> >>>>> Also the TRUEs cause problems if some dimensions are 0: >>>>> >>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE] >>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] : >>>>> (subscript) logical subscript too long >>>> >>>> OK. But this is easy enough to handle. >>>> >>>>> >>>>> H. >>>>> >>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote: >>>>>> I suspect this will have suboptimal performance since the TRUEs will >>>>>> get recycled. (Maybe there is, or could be, ALTREP, support for >>>>>> recycling) >>>>>> Hadley >>>> >>>> >>>> AFAICS, it is not an issue. Taking >>>> >>>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>>> >>>> as a test case >>>> >>>> and using a function that will either use the literal code `x[i,,,,drop=FALSE]' or `eval(mc)': >>>> >>>> subset_ROW4 <- >>>> function(x, i, useLiteral=FALSE) >>>> { >>>> literal <- quote(x[i,,,,drop=FALSE]) >>>> mc <- quote(x[i]) >>>> nd <- max(1L, length(dim(x))) >>>> mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L) >>>> mc[["drop"]] <- FALSE >>>> if (useLiteral) >>>> eval(literal) >>>> else >>>> eval(mc) >>>> } >>>> >>>> I get identical times with >>>> >>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE)) >>>> >>>> and with >>>> >>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE)) >>> >>> I think that's because you used a relatively low precision timing >>> mechnaism, and included the index generation in the timing. I see: >>> >>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>> i <- seq(1,length = 10, by = 100) >>> >>> bench::mark( >>> arr[i, TRUE, TRUE, TRUE], >>> arr[i, , , ] >>> ) >>> #> # A tibble: 2 x 1 >>> #> expression min mean median max n_gc >>> #> <chr> <bch:t> <bch:t> <bch:tm> <bch:tm> <dbl> >>> #> 1 arr[i, TRUE,? 7.4?s 10.9?s 10.66?s 1.22ms 2 >>> #> 2 arr[i, , , ] 7.06?s 8.8?s 7.85?s 538.09?s 2 >>> >>> So not a huge difference, but it's there. >> >> >> Funny. I get similar results to yours above albeit with smaller differences. Usually < 5 percent. >> >> But with subset_ROW4 I see no consistent difference. >> >> In this example, it runs faster on average using `eval(mc)' to return the result: >> >>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>> i <- seq(1,length=10,by=100) >>> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8] >> # A tibble: 2 x 8 >> expression min mean median max `itr/sec` mem_alloc n_gc >> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> >> 1 subset_ROW4(arr, i, FALSE) 28.9?s 34.9?s 32.1?s 1.36ms 28686. 5.05KB 5 >> 2 subset_ROW4(arr, i, TRUE) 28.9?s 35?s 32.4?s 875.11?s 28572. 5.05KB 5 >>> >> >> And on subsequent reps the lead switches back and forth. >> >> >> Chuck >> > > > > -- > http://hadley.nz > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
The C code for subsetting doesn't need to recycle a logical subscript. It only needs to walk on it and start again at the beginning of the vector when it reaches the end. Not exactly the same as detecting the "take everything along that dimension" situation though. x[TRUE, TRUE, TRUE] triggers the full subsetting machinery when x[] and x[ , , ] could (and should) easily avoid it. H. On 06/08/2018 01:49 PM, Hadley Wickham wrote:> Hmmm, yes, there must be some special case in the C code to avoid > recycling a length-1 logical vector: > > dims <- c(4, 4, 4, 1e5) > > arr <- array(rnorm(prod(dims)), dims) > dim(arr) > #> [1] 4 4 4 100000 > i <- c(1, 3) > > bench::mark( > arr[i, TRUE, TRUE, TRUE], > arr[i, , , ] > )[c("expression", "min", "mean", "max")] > #> # A tibble: 2 x 4 > #> expression min mean max > #> <chr> <bch:tm> <bch:tm> <bch:tm> > #> 1 arr[i, TRUE, TRUE, TRUE] 41.8ms 43.6ms 46.5ms > #> 2 arr[i, , , ] 41.7ms 43.1ms 46.3ms > > > On Fri, Jun 8, 2018 at 12:31 PM, Berry, Charles <ccberry at ucsd.edu> wrote: >> >> >>> On Jun 8, 2018, at 11:52 AM, Hadley Wickham <h.wickham at gmail.com> wrote: >>> >>> On Fri, Jun 8, 2018 at 11:38 AM, Berry, Charles <ccberry at ucsd.edu> wrote: >>>> >>>> >>>>> On Jun 8, 2018, at 10:37 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: >>>>> >>>>> Also the TRUEs cause problems if some dimensions are 0: >>>>> >>>>>> matrix(raw(0), nrow=5, ncol=0)[1:3 , TRUE] >>>>> Error in matrix(raw(0), nrow = 5, ncol = 0)[1:3, TRUE] : >>>>> (subscript) logical subscript too long >>>> >>>> OK. But this is easy enough to handle. >>>> >>>>> >>>>> H. >>>>> >>>>> On 06/08/2018 10:29 AM, Hadley Wickham wrote: >>>>>> I suspect this will have suboptimal performance since the TRUEs will >>>>>> get recycled. (Maybe there is, or could be, ALTREP, support for >>>>>> recycling) >>>>>> Hadley >>>> >>>> >>>> AFAICS, it is not an issue. Taking >>>> >>>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>>> >>>> as a test case >>>> >>>> and using a function that will either use the literal code `x[i,,,,drop=FALSE]' or `eval(mc)': >>>> >>>> subset_ROW4 <- >>>> function(x, i, useLiteral=FALSE) >>>> { >>>> literal <- quote(x[i,,,,drop=FALSE]) >>>> mc <- quote(x[i]) >>>> nd <- max(1L, length(dim(x))) >>>> mc[seq(4,length=nd-1L)] <- rep(TRUE, nd-1L) >>>> mc[["drop"]] <- FALSE >>>> if (useLiteral) >>>> eval(literal) >>>> else >>>> eval(mc) >>>> } >>>> >>>> I get identical times with >>>> >>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),TRUE)) >>>> >>>> and with >>>> >>>> system.time(for (i in 1:10000) subset_ROW4(arr,seq(1,length=10,by=100),FALSE)) >>> >>> I think that's because you used a relatively low precision timing >>> mechnaism, and included the index generation in the timing. I see: >>> >>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>> i <- seq(1,length = 10, by = 100) >>> >>> bench::mark( >>> arr[i, TRUE, TRUE, TRUE], >>> arr[i, , , ] >>> ) >>> #> # A tibble: 2 x 1 >>> #> expression min mean median max n_gc >>> #> <chr> <bch:t> <bch:t> <bch:tm> <bch:tm> <dbl> >>> #> 1 arr[i, TRUE,? 7.4?s 10.9?s 10.66?s 1.22ms 2 >>> #> 2 arr[i, , , ] 7.06?s 8.8?s 7.85?s 538.09?s 2 >>> >>> So not a huge difference, but it's there. >> >> >> Funny. I get similar results to yours above albeit with smaller differences. Usually < 5 percent. >> >> But with subset_ROW4 I see no consistent difference. >> >> In this example, it runs faster on average using `eval(mc)' to return the result: >> >>> arr <- array(rnorm(2^22),c(2^10,4,4,4)) >>> i <- seq(1,length=10,by=100) >>> bench::mark(subset_ROW4(arr,i,FALSE), subset_ROW4(arr,i,TRUE))[,1:8] >> # A tibble: 2 x 8 >> expression min mean median max `itr/sec` mem_alloc n_gc >> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> >> 1 subset_ROW4(arr, i, FALSE) 28.9?s 34.9?s 32.1?s 1.36ms 28686. 5.05KB 5 >> 2 subset_ROW4(arr, i, TRUE) 28.9?s 35?s 32.4?s 875.11?s 28572. 5.05KB 5 >>> >> >> And on subsequent reps the lead switches back and forth. >> >> >> Chuck >> > > >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
> On Jun 8, 2018, at 1:49 PM, Hadley Wickham <h.wickham at gmail.com> wrote: > > Hmmm, yes, there must be some special case in the C code to avoid > recycling a length-1 logical vector:Here is a version that (I think) handles Herve's issue of arrays having one or more 0 dimensions. subset_ROW <- function(x,i) { dims <- dim(x) index_list <- which(dims[-1] != 0L) + 3 mc <- quote(x[i]) nd <- max(1L, length(dims)) mc[ index_list ] <- list(TRUE) mc[[ nd + 3L ]] <- FALSE names( mc )[ nd+3L ] <- "drop" eval(mc) } Curiously enough the timing is *much* better for this implementation than for the first version I sent. Constructing a version of `mc' that looks like `x[i,,,,drop=FALSE]' can be done with `alist(a=)' in place of `list(TRUE)' in the earlier version but seems to slow things down noticeably. It requires almost twice (!!) as much time as the version above. Best, Chuck