martin@gsc.riken.jp
2004-Oct-17 08:50 UTC
[Rd] ecdf with lots of ties is inefficient (PR#7292)
Full_Name: Martin Frith Version: R-2.0.0 OS: linux-gnu Submission from: (NULL) (134.160.83.73) I have large vectors containing 100,000 to 20,000,000 numbers. However, they only contain a few hundred *distinct* numbers (e.g. positive integers < 200). When I do ecdf(v), it either runs out of memory, or it succeeds, but when I plot the ecdf with postscript, the output is unnecessarily bloated because the same lines get redrawn many times. The complexity of ecdf should depend on how many distinct numbers there are, not how many total numbers. This is my first bug report, so forgive me if I've done something stupid!
I would add that some action has to be taken in presence of missing values, i.e. > x <- c(1,2,2,4,7, NA, 10,12, 15,20) > ecdf(x) Error in xy.coords(x, y) : x and y lengths differ stefano On Oct 17, 2004, at 8:50 AM, martin@gsc.riken.jp wrote:> Full_Name: Martin Frith > Version: R-2.0.0 > OS: linux-gnu > Submission from: (NULL) (134.160.83.73) > > > I have large vectors containing 100,000 to 20,000,000 numbers. > However, they > only contain a few hundred *distinct* numbers (e.g. positive integers > < 200). > When I do ecdf(v), it either runs out of memory, or it succeeds, but > when I plot > the ecdf with postscript, the output is unnecessarily bloated because > the same > lines get redrawn many times. The complexity of ecdf should depend on > how many > distinct numbers there are, not how many total numbers. > > This is my first bug report, so forgive me if I've done something > stupid! > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Prof Brian Ripley
2004-Oct-17 09:24 UTC
[Rd] ecdf with lots of ties is inefficient (PR#7292)
This seems a _very_ unusual use of ecdf -- what are you using it for that a sample of size 10,000 would not do equally well? If you have a need for a more efficient version of ecdf, please develop one and submit a patch. I don't think it would be hard as ecdf does x <- sort(x) rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered") _but_ it might be hard to recognize the situation you are in without much computation. Something along the lines of vals <- sort(unique(x)) y <- tabulate(match(x, vals)) rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered") should work better for you and may be little slower if there are no ties, but will use more memory. A quick play suggests that the real problem is not with ecdf (at least not for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the result. Please investigate what might be a reasonable compromise. On Sun, 17 Oct 2004 martin@gsc.riken.jp wrote:> Full_Name: Martin Frith > Version: R-2.0.0 > OS: linux-gnu > Submission from: (NULL) (134.160.83.73) > > > I have large vectors containing 100,000 to 20,000,000 numbers. However, > they only contain a few hundred *distinct* numbers (e.g. positive > integers < 200). When I do ecdf(v), it either runs out of memory, or it > succeeds, but when I plot the ecdf with postscript, the output is > unnecessarily bloated because the same lines get redrawn many times. The > complexity of ecdf should depend on how many distinct numbers there are, > not how many total numbers. > > This is my first bug report, so forgive me if I've done something stupid!-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
p.dalgaard@biostat.ku.dk
2004-Oct-17 11:27 UTC
[Rd] ecdf with lots of ties is inefficient (PR#7292)
Prof Brian Ripley <ripley@stats.ox.ac.uk> writes:> vals <- sort(unique(x)) > y <- tabulate(match(x, vals)) > rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0, > yright = 1, f = 0, ties = "ordered") > > should work better for you and may be little slower if there are no ties, > but will use more memory....and if all you need is the plot, continue Brian's code with Fv <- c(0,cumsum(y))/sum(y) xx <- c(vals[1],vals) plot(xx, Fv, type="s") which might well be close enough for your purposes. Or, of course, Fs <- stepfun(vals,c(0,cumsum(y)/sum(y))) plot(Fs,verticals=FALSE) -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907