You can time it yourself on increasingly large subsets of your data. E.g.,
> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
x3=sample(c("A","B","C"),size=1e6,replace=TRUE))> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
print(system.time(rq(data=d, y ~ x1 + x2*x3,
tau=0.9)))},FUN.VALUE=numeric(5))
user system elapsed
0 0 0
user system elapsed
0 0 0
user system elapsed
0.02 0.00 0.01
user system elapsed
0.01 0.00 0.02
user system elapsed
0.10 0.00 0.11
user system elapsed
1.09 0.00 1.10
user system elapsed
13.05 0.02 13.07
user system elapsed
273.30 0.11 273.74> t
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30
sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11
elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74
user.child NA NA NA NA NA NA NA NA
sys.child NA NA NA NA NA NA NA NA
Do some regressions on t["elapsed",] as a function of n and predict up
to
n=10^7. E.g.,> summary(lm(t["elapsed",] ~ poly(n,4)))
Call:
lm(formula = t["elapsed", ] ~ poly(n, 4))
Residuals:
1 2 3 4 5 6
7 8
-2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05
-9.199e-07 2.715e-09
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 ***
poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 ***
poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 ***
poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 ***
poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 **
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.003565 on 3 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14
It does not look good for n=10^7.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
> Hi all,
>
> I'm using quantreg rq() to perform quantile regression on a large data
set.
> Each record has 4 fields and there are about 18 million records in total. I
> wonder if anyone has tried rq() on a large dataset and how long I should
> expect it to finish. Or it is simply too large and I should subsample the
> data. I would like to have an idea before I start to run and wait forever.
>
> In addition, I will appreciate if anyone could give me an idea how long it
> takes for rq() to run approximately for certain dataset size.
>
> Yunqi
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]