Zheng, Xin (NIH) [C]
2009-Apr-14 04:29 UTC
[R] any other fast method for median calculation
Hi there, I got a data frame with more than 200k columns. How could I get median of each column fast? mapply is the fastest function I know for that, it's not yet satisfied though. It seems function "median" in R calculates median by "sort" and "mean". I am wondering if there is another function with better algorithm. Any hint? Thanks, Xin Zheng
Sorting with an appropriate algorithm is nlog(n), so it's very hard to get the 'exact' median any faster. However, if you can cope with a less precise median, you could use a binary search between max(x) and min(x) with low tolerance or comparatively few iterations. In native R, though, that isn;t going to be fast; interpreter overhead will likely more than wipe out any reduction in number of comparisons. In any case, it looks like you are not constrained by the median algorithm, but by the number of calls. You might do a lot better with apply, though> apply(df,2,median)On my system 200k columns were processed in negligible time by apply and I'm still waiting for mapply. S>>> "Zheng, Xin (NIH) [C]" <zhengxin at mail.nih.gov> 14/04/2009 05:29:40 >>>Hi there, I got a data frame with more than 200k columns. How could I get median of each column fast? mapply is the fastest function I know for that, it's not yet satisfied though. It seems function "median" in R calculates median by "sort" and "mean". I am wondering if there is another function with better algorithm. Any hint? Thanks, Xin Zheng ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
There is a slightly faster algorithm in my quantreg package, see kuantile() but this is only significant when sample sizes are very large. In your case you really need a wrapper that keeps the loop over columns within some lower level language. url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Apr 13, 2009, at 11:29 PM, Zheng, Xin (NIH) [C] wrote:> Hi there, > > I got a data frame with more than 200k columns. How could I get > median of each column fast? mapply is the fastest function I know > for that, it's not yet satisfied though. > > It seems function "median" in R calculates median by "sort" and > "mean". I am wondering if there is another function with better > algorithm. > > Any hint? > > Thanks, > > Xin Zheng > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.