Camilo Mora
2015-Feb-05 05:59 UTC
[R] Fastest way to calculate quantile in large data.table
Hi everyone, I have a data.table with 200 columns and few million rows and am trying to calculate the .1 and .9 quantiles for each row across all 200 columns. I have found different ways to do this, all with different performances. The examples I used are below. I wonder whether there is a faster way to do this? Thanks and best, Camilo library(data.table) v <- data.table(x=runif(10000),x2 = runif(10000), x3=runif(10000),x4=runif(10000)) v[,Names:=rownames(v)] #test 1 using .SD but not .SDcols Sys.time()->StartTEST1 v[, as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()->EndTEST1 #test 2 using .SD and .SDcols Sys.time()->StartTEST2 v[, as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names,.SDcols=1:4] Sys.time()->EndTEST2 #test 3 using colnames directly. This is the fastest I found Sys.time()->StartTEST3 v[, as.list(quantile(c(x , x2, x3, x4 ),c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()->EndTEST3 # melting the database and doing quantile by summary. This is the second fastest, which is ironic given that the database has to be melted first library(reshape2) Sys.time()->StartTEST4 vs<-melt(v) vs[, as.list(quantile(value,c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()->EndTEST4 EndTEST1-StartTEST1 EndTEST2-StartTEST2 EndTEST3-StartTEST3 EndTEST4-StartTEST4 [[alternative HTML version deleted]]