I often need to work with large vectors whose distribution I want to
summarize by Q-Q plots. Since the vectors are large, I use a subset
of quantiles, e.g.
quantile(x, probs = ppoints(1000))
Unfortunately, this seemed to be taking too long for large x (much
longer than 'sort'). I initially thought maybe quantile was doing
something sophisticated (which I don't really need with a large data
set), so I would write something simple myself. I did, but then I
noticed that 'quantile' was doing essentially the same thing, with the
exception that it was calling 'sort' with a non-null 'partial'
argument.
?sort says:
If 'partial' is not 'NULL', it is taken to contain indices
of
elements of 'x' which are to be placed in their correct positions
by partial sorting. After the sort, the values specified in
'partial' are in their correct position in the sorted array. Any
values smaller than these values are guaranteed to have a smaller
index in the sorted array and any values which are greater are
guaranteed to have a bigger index in the sorted array. This is
included for efficiency, and many of the options are not available
for partial sorting.
However, rather than being efficient, this seems to considerably slow
things down (I haven't checked memory efficiency). Consider the
following code (imitating what 'quantile' does by default):
probs <- ppoints(1000)
for (i in seq(5000, 50000, 5000))
{
x <- rnorm(i)
n <- length(x)
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
keep <- as.integer(unique(c(lo, hi)))
cat(system.time(y1 <- sort(x, partial = keep))[1])
cat("\t")
cat(system.time(y2 <- sort(x))[1])
cat("\t")
cat(round(max(abs( y1 - y2 )), digits = 3))
cat("\t")
cat(max(abs( y1[keep] - y2[keep] )))
cat("\n")
}
The first two columns in the output are timings for 'sort' with and
without 'partial', the last two columns are just a rough check that
partial sorting is doing what it claims. With R-2.2:
0.78 0 0.031 0
1.64 0.01 0.565 0
2.59 0.01 0.646 0
3.44 0.01 0.487 0
4.4 0.01 0.569 0
5.26 0.01 0.642 0
6.29 0.01 1.071 0
7.18 0.02 0.566 0
8.18 0.02 1.094 0
9.01 0.03 0.89 0> version
_
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status Patched
major 2
minor 2.0
year 2005
month 10
day 16
svn rev 35911
language R
This also holds on (a faster machine running) r-devel
0.39 0 0.176 0
0.85 0.01 0.62 0
1.32 0 0.193 0
1.8 0 0.949 0
2.29 0 0.73 0
2.77 0 1.185 0
3.25 0.01 0.813 0
3.75 0.01 1.171 0
4.21 0.01 0.827 0
4.74 0.01 0.372 0> version
_
platform x86_64-unknown-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status Under development (unstable)
major 2
minor 3.0
year 2005
month 11
day 25
svn rev 36468
language R
Speed improves when the number of 'partial' indices is small, but even
for 10 indices plain 'sort' is faster.
Am I missing something? I haven't checked if NA's make a difference
or if there might be memory usage issues, but even so, could
'quantile' at least have an option to disable partial sorting?
Deepayan