Suharto Anggono Suharto Anggono
2017-Oct-01 16:39 UTC
[Rd] Revert to R 3.2.x code of logicalSubscript in subscript.c?
Currently, in function 'logicalSubscript' in subscript.c, the case of no recycling is handled like the implentation of R function 'which'. It passes through the data only once, but uses more memory. It is since R 3.3.0. For the case of recycling, two passes are done, first to get number of elements in the result. Also since R 3.3.0, function 'makeSubscript' in subscript.c doesn't call 'duplicate' on logical index vector. A side note: I guess that it is safe not to call 'duplicate' on logical index vector, even if it is the one being modified in subassignment, because it is converted to positive indices before use in extraction or replacement. If so, isn't it true for character index vector as well? Here are examples of subsetting a numeric vector of length 10^8 with logical index vector, inspired by Hong Ooi's answer in https://stackoverflow.com/questions/17510778/why-is-subsetting-on-a-logical-type-slower-than-subsetting-on-numeric-type . I presents two extreme cases, each with no-recycling and recycling versions that convert to the same positive indices. Difference between the two versions can be attributed to function 'logicalSubscript'. Example 1: select none x <- numeric(1e8) i <- rep(FALSE, length(x))# no reycling system.time(x[i]) system.time(x[i]) i <- FALSE# recycling system.time(x[i]) system.time(x[i]) Output: ?? user? system elapsed ? 0.083?? 0.000?? 0.083 ?? user? system elapsed ? 0.085?? 0.000?? 0.085 ?? user? system elapsed ? 0.144?? 0.000?? 0.144 ?? user? system elapsed ? 0.143?? 0.000?? 0.144 Example 2: select all x <- numeric(1e8) i <- rep(TRUE, length(x))# no reycling system.time(x[i]) system.time(x[i]) i <- TRUE# recycling system.time(x[i]) system.time(x[i]) Output: ?? user? system elapsed ? 0.538?? 0.741?? 1.292 ?? user? system elapsed ? 0.506?? 0.668?? 1.175 ?? user? system elapsed ? 0.448?? 0.534?? 0.986 ?? user? system elapsed ? 0.431?? 0.528?? 0.960 The results were from R 3.3.2 on http://rextester.com/l/r_online_compiler . The no-recycling version took longer time than the recycling version for example 2, where more time was taken in both versions. Function 'logicalSubscript' in subscript.c in R 3.2.x also use a faster code for the case of no recycling, but does two passes in all cases. Treatment for the case of recycling is identical with current code. Function 'logicalSubscript' in subscript.c affects subsetting with negative indices, because negative indices are converted first to a logical index vector with the same length as the vector (no recycling). Example, comparing times of x[-1] and its equivalent, x[2:length(x)] : x <- numeric(1e8) system.time(x[-1]) system.time(x[-1]) system.time(x[2:length(x)]) system.time(x[2:length(x)]) Output from R 3.3.2 on http://rextester.com/l/r_online_compiler : ?? user? system elapsed ? 0.591?? 0.903?? 1.515 ?? user? system elapsed ? 0.558?? 0.822?? 1.384 ?? user? system elapsed ? 0.620?? 0.659?? 1.285 ?? user? system elapsed ? 0.607?? 0.663?? 1.274 Output from R 3.2.2 in Zenppelin Notebook, https://my.datascientistworkbench.com/tools/zeppelin-notebook/ : user? system elapsed ? 1.156?? 1.636?? 2.794 ?? user? system elapsed ? 0.884?? 1.528?? 2.413 ?? user? system elapsed ? 0.932?? 1.544?? 2.476 ?? user? system elapsed ? 0.932?? 1.584?? 2.519>From above, apparently, x[-1] consistently took longer time than x[2:length(x)] with R 3.3.2, but not with R 3.2.2.So, how about reverting to R 3.2.x code of function 'logicalSubscript' in subscript.c?