thr3ads.net - R help - [R] removing data look-ahead, something faster. [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Ben quant

2012-Mar-03 21:30 UTC

[R] removing data look-ahead, something faster.

Hello,

Thank you for your help/advice!

The issue here is speed/efficiency. I can do what I want, but its really
slow.

The goal is to have the ability to do calculations on my data and have it
adjusted for look-ahead. I see two ways to do this:
(I'm open to more ideas. My terminology: Unadjusted = values not adjusted
for look-ahead bias; adjusted = values adjusted for look-ahead bias.)

1) I could a) do calculations on unadjusted values then b) adjust the
resulting values for look-ahead bias. Here is what I mean:
 a) I could say the following using time series of val1: [(val1 - val1 4
periods ago) / val1 4 periods ago] = resultval. ("Periods" correspond
to
the z.dates in my example below.)
b) Then I would adjust the resultval for look-ahead based on val1's
associated report date.
Note: I don't think this will be the fastest.

2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
periods ago] = resultval, but my calculation function would get the
'right'
values that would have no look-ahead bias. I'm not sure how I would do
this, but maybe a query starting with the date that I want, indexed to
appropriate report date indexed to the correct value to return. But how do
I do this in R? I think I would have to put this in our database and do a
query. The data comes to me in RData format. I could put it all in our
database via PpgSQL which we already use.
Note: I think this will be fastest.

Anyway, my first attempt at this was to solve part b of #1 above. Here is
how my data looks and my first attempt at solving part b of idea #1 above.
It only takes 0.14 seconds for my mock data, but that is way too slow. The
major things slowing it down A) the loop, B) the merge statement.

# mock data: this is how it comes to me (raw)
# in practice I have over 10,000 columns

# the starting 'periods' for my data
z.dates
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
             
"20070426","--","--","--","--","--","--","20090319",
             
"--","--","--","--","--","--","--","--"),
            nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################# change anything below. I can't change
anything above this line.

# My first attempt at this was to solve part b of #1 above.
fix = function(x)
{
  year = substring(x, 1, 4);
  mo = substring(x, 5, 6);
  day = substring(x, 7, 8);
  ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(eps_rd)

dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"),
by "day")
dt = sapply(dt1, as.character)

fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
  uac = ua[,cn]
  tkr = colnames(ua)[cn]
  rdc = rd[,cn]
  ua_rd = cbind(uac,rdc)
  colnames(ua_rd) = c(tkr,'rt_date')
  xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
  xx = as.character(xx1[,2])
  values <- c(NA, xx[!is.na(xx)])
  ind = cumsum(!is.na(xx)) + 1
  y <- values[ind]
  if(ck_rows == length(y)){
    fin  = data.frame(fin,y)
  }else{
    bad = c(bad,tkr)
  }
}

colnames(fin) = c('daily_dates',nms)

# after this I would slice and dice the data into weekly, monthly, etc.
periodicity as needed, but this leaves it in daily format which is as
granular as I will get.

print("over all time for loop")
print(Sys.time()-start_t_all)

Regards,

Ben

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Mar 2012 - removing data look-ahead, something faster.

[R] removing data look-ahead, something faster.

Apparently Analagous Threads

Wisdom of the Ancients