thr3ads.net - R devel - [Rd] ecdf with lots of ties is inefficient (PR#7292) [Oct 2004]

If this information is useful, please help other people find it:
Share via:

martin@gsc.riken.jp

2004-Oct-17 08:50 UTC

[Rd] ecdf with lots of ties is inefficient (PR#7292)

Full_Name: Martin Frith
Version: R-2.0.0
OS: linux-gnu
Submission from: (NULL) (134.160.83.73)


I have large vectors containing 100,000 to 20,000,000 numbers. However, they
only contain a few hundred *distinct* numbers (e.g. positive integers < 200).
When I do ecdf(v), it either runs out of memory, or it succeeds, but when I plot
the ecdf with postscript, the output is unnecessarily bloated because the same
lines get redrawn many times. The complexity of ecdf should depend on how many
distinct numbers there are, not how many total numbers.

This is my first bug report, so forgive me if I've done something stupid!

stefano iacus

2004-Oct-17 09:06 UTC

head link

[Rd] ecdf with lots of ties is inefficient (PR#7292)

I would add that some action has to be taken in presence of missing 
values, i.e.

 > x <- c(1,2,2,4,7, NA, 10,12, 15,20)
 > ecdf(x)
Error in xy.coords(x, y) : x and y lengths differ

stefano

On Oct 17, 2004, at 8:50 AM, martin@gsc.riken.jp wrote:
> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (134.160.83.73)
>
>
> I have large vectors containing 100,000 to 20,000,000 numbers. 
> However, they
> only contain a few hundred *distinct* numbers (e.g. positive integers 
> < 200).
> When I do ecdf(v), it either runs out of memory, or it succeeds, but 
> when I plot
> the ecdf with postscript, the output is unnecessarily bloated because 
> the same
> lines get redrawn many times. The complexity of ecdf should depend on 
> how many
> distinct numbers there are, not how many total numbers.
>
> This is my first bug report, so forgive me if I've done something 
> stupid!
>
> ______________________________________________
> R-devel@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Prof Brian Ripley

2004-Oct-17 09:24 UTC

head link

[Rd] ecdf with lots of ties is inefficient (PR#7292)

This seems a _very_ unusual use of ecdf -- what are you using it for that
a sample of size 10,000 would not do equally well?

If you have a need for a more efficient version of ecdf, please develop 
one and submit a patch.  I don't think it would be hard as ecdf does

    x <- sort(x)
    rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0,
                      yright = 1, f = 0, ties = "ordered")

_but_ it might be hard to recognize the situation you are in without much
computation.  Something along the lines of

    vals <- sort(unique(x))
    y <- tabulate(match(x, vals))
    rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft
= 0,
                      yright = 1, f = 0, ties = "ordered")

should work better for you and may be little slower if there are no ties, 
but will use more memory.

A quick play suggests that the real problem is not with ecdf (at least not
for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the
result.  Please investigate what might be a reasonable compromise.

On Sun, 17 Oct 2004 martin@gsc.riken.jp wrote:
> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (134.160.83.73)
> 
> 
> I have large vectors containing 100,000 to 20,000,000 numbers. However,
> they only contain a few hundred *distinct* numbers (e.g. positive
> integers < 200). When I do ecdf(v), it either runs out of memory, or it
> succeeds, but when I plot the ecdf with postscript, the output is
> unnecessarily bloated because the same lines get redrawn many times. The
> complexity of ecdf should depend on how many distinct numbers there are,
> not how many total numbers.
> 
> This is my first bug report, so forgive me if I've done something
stupid!
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

p.dalgaard@biostat.ku.dk

2004-Oct-17 11:27 UTC

head link

[Rd] ecdf with lots of ties is inefficient (PR#7292)

Prof Brian Ripley <ripley@stats.ox.ac.uk> writes:
>     vals <- sort(unique(x))
>     y <- tabulate(match(x, vals))
>     rval <- approxfun(vals, cumsum(y)/n, method = "constant",
yleft = 0,
>                       yright = 1, f = 0, ties = "ordered")
> 
> should work better for you and may be little slower if there are no ties, 
> but will use more memory.
...and if all you need is the plot, continue Brian's code with

  Fv <- c(0,cumsum(y))/sum(y)
  xx <- c(vals[1],vals)
  plot(xx, Fv, type="s")

which might well be close enough for your purposes. Or, of course,

  Fs <- stepfun(vals,c(0,cumsum(y)/sum(y)))
  plot(Fs,verticals=FALSE)

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Reasonably Related Threads

Search for more possibly parallel threads

R devel - Oct 2004 - ecdf with lots of ties is inefficient (PR#7292)

[Rd] ecdf with lots of ties is inefficient (PR#7292)

[Rd] ecdf with lots of ties is inefficient (PR#7292)

[Rd] ecdf with lots of ties is inefficient (PR#7292)

[Rd] ecdf with lots of ties is inefficient (PR#7292)

Reasonably Related Threads