Dear R users, I am looking for a reference to an algorithm for estimation of sample quantiles which does not require bringing the whole data into memory (more precisely its memory complexity should be much less than linear, ideally constant). I realize that such an algorithm can only be approximate and actually quite wrong for some samples, but that's fine with me. Thank you, Vadim -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Prof Brian D Ripley
2001-Apr-03 18:54 UTC
[R] single-pass algorithm for quantile calculation
On Tue, 3 Apr 2001, Vadim Ogranovich wrote:> Dear R users, I am looking for a reference to an algorithm for estimation of > sample quantiles which does not require bringing the whole data into memory > (more precisely its memory complexity should be much less than linear, > ideally constant). I realize that such an algorithm can only be approximate > and actually quite wrong for some samples, but that's fine with me.There is no approximately accurate algorithm (consider a sorted dataset) that does not look at the whole file. Try random subsampling (in whatever database holds your file, or by bringing in chunks and sampling while in memory). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 3 Apr 2001 13:31:32 -0500 , you wrote in message <AFD78192EC49D311BFAE00902798AB8F0BDEA0 at jupiter.sc.arbitrade.com>:>Dear R users, I am looking for a reference to an algorithm for estimation of >sample quantiles which does not require bringing the whole data into memory >(more precisely its memory complexity should be much less than linear, >ideally constant). I realize that such an algorithm can only be approximate >and actually quite wrong for some samples, but that's fine with me.What about a single pass SRS algorithm to get a fixed size subsample, then approximate the quantiles of the real sample with the quantiles of the subsample? Duncan Murdoch -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
As it happens, Gib Bassett recently mentioned to me a rather extreme version of a procedure like this. Consider the following estimator of the median: Given iid observations arranged as a matrix {X_ij: i=1,...r, j=1,...,c} with rc=n, let betahat_n = min_j {max_i {X_ij}}. If you let r = r_n = [log_2 n] you can show that that betahat is a median unbiased estimator of the median and you need only make one pass through the data, with two storage locations (one for the current column max and the other for the current row min and you need only n + [log_2 n] comparisons. Of course from a statistical standpoint the procedure is awful, converging in distribution at rate 1/log_2 n. A somewhat less extreme version of this was suggested in Pearl, J. (1981) A space efficient on-line method of computing quantile estimates, J. of Algs, 2, 164-177. url: http://www.econ.uiuc.edu Roger Koenker email roger at ysidro.econ.uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Tue, 3 Apr 2001, Vadim Ogranovich wrote:> Dear R users, I am looking for a reference to an algorithm for estimation of > sample quantiles which does not require bringing the whole data into memory > (more precisely its memory complexity should be much less than linear, > ideally constant). I realize that such an algorithm can only be approximate > and actually quite wrong for some samples, but that's fine with me. > > Thank you, > Vadim > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._