Hello,
Thank you for your help/advice!
The issue here is speed/efficiency. I can do what I want, but its really
slow.
The goal is to have the ability to do calculations on my data and have it
adjusted for look-ahead. I see two ways to do this:
(I'm open to more ideas. My terminology: Unadjusted = values not adjusted
for look-ahead bias; adjusted = values adjusted for look-ahead bias.)
1) I could a) do calculations on unadjusted values then b) adjust the
resulting values for look-ahead bias. Here is what I mean:
a) I could say the following using time series of val1: [(val1 - val1 4
periods ago) / val1 4 periods ago] = resultval. ("Periods" correspond
to
the z.dates in my example below.)
b) Then I would adjust the resultval for look-ahead based on val1's
associated report date.
Note: I don't think this will be the fastest.
2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
periods ago] = resultval, but my calculation function would get the
'right'
values that would have no look-ahead bias. I'm not sure how I would do
this, but maybe a query starting with the date that I want, indexed to
appropriate report date indexed to the correct value to return. But how do
I do this in R? I think I would have to put this in our database and do a
query. The data comes to me in RData format. I could put it all in our
database via PpgSQL which we already use.
Note: I think this will be fastest.
Anyway, my first attempt at this was to solve part b of #1 above. Here is
how my data looks and my first attempt at solving part b of idea #1 above.
It only takes 0.14 seconds for my mock data, but that is way too slow. The
major things slowing it down A) the loop, B) the merge statement.
# mock data: this is how it comes to me (raw)
# in practice I have over 10,000 columns
# the starting 'periods' for my data
z.dates
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
"20070426","--","--","--","--","--","--","20090319",
"--","--","--","--","--","--","--","--"),
nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)
# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)
################################# change anything below. I can't change
anything above this line.
# My first attempt at this was to solve part b of #1 above.
fix = function(x)
{
year = substring(x, 1, 4);
mo = substring(x, 5, 6);
day = substring(x, 7, 8);
ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
}
rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(eps_rd)
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"),
by "day")
dt = sapply(dt1, as.character)
fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
uac = ua[,cn]
tkr = colnames(ua)[cn]
rdc = rd[,cn]
ua_rd = cbind(uac,rdc)
colnames(ua_rd) = c(tkr,'rt_date')
xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
xx = as.character(xx1[,2])
values <- c(NA, xx[!is.na(xx)])
ind = cumsum(!is.na(xx)) + 1
y <- values[ind]
if(ck_rows == length(y)){
fin = data.frame(fin,y)
}else{
bad = c(bad,tkr)
}
}
colnames(fin) = c('daily_dates',nms)
# after this I would slice and dice the data into weekly, monthly, etc.
periodicity as needed, but this leaves it in daily format which is as
granular as I will get.
print("over all time for loop")
print(Sys.time()-start_t_all)
Regards,
Ben
[[alternative HTML version deleted]]