Hello,
Does anyone know of a way I can speed this up? Basically I'm attempting to
get the data item on the same row as the report date for each report date
available. In reality, I have over 11k of columns, not just A, B, C, D and
I have to do that over 100 times. My solution is slow, but it works. The
loop is slow because of merge.
# create sample data
z.dates
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
"20070426","--","--","--","--","--","--","20090319",
"--","--","--","--","--","--","--","--"),
nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)
# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)
################################# change anything below.
# My first attempt at this
fix = function(x)
{
year = substring(x, 1, 4);
mo = substring(x, 5, 6);
day = substring(x, 7, 8);
ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
}
rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"),
by "day")
dt = sapply(dt1, as.character)
fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
uac = ua[,cn]
tkr = colnames(ua)[cn]
rdc = rd[,cn]
ua_rd = cbind(uac,rdc)
colnames(ua_rd) = c(tkr,'rt_date')
xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
xx = as.character(xx1[,2])
values <- c(NA, xx[!is.na(xx)])
ind = cumsum(!is.na(xx)) + 1
y <- values[ind]
if(ck_rows == length(y)){
fin = data.frame(fin,y)
}else{
bad = c(bad,tkr)
}
}
colnames(fin) = c('daily_dates',nms)
print("over all time for loop")
print(Sys.time()-start_t_all)
print(fin)
Thanks,
Ben
PS - the real/over-all issue is below, but it is probably too involved to
follow.
On Sat, Mar 3, 2012 at 2:30 PM, Ben quant <ccquant@gmail.com> wrote:
> Hello,
>
> Thank you for your help/advice!
>
> The issue here is speed/efficiency. I can do what I want, but its really
> slow.
>
> The goal is to have the ability to do calculations on my data and have it
> adjusted for look-ahead. I see two ways to do this:
> (I'm open to more ideas. My terminology: Unadjusted = values not
adjusted
> for look-ahead bias; adjusted = values adjusted for look-ahead bias.)
>
> 1) I could a) do calculations on unadjusted values then b) adjust the
> resulting values for look-ahead bias. Here is what I mean:
> a) I could say the following using time series of val1: [(val1 - val1 4
> periods ago) / val1 4 periods ago] = resultval. ("Periods"
correspond to
> the z.dates in my example below.)
> b) Then I would adjust the resultval for look-ahead based on val1's
> associated report date.
> Note: I don't think this will be the fastest.
>
> 2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
> periods ago] = resultval, but my calculation function would get the
'right'
> values that would have no look-ahead bias. I'm not sure how I would do
> this, but maybe a query starting with the date that I want, indexed to
> appropriate report date indexed to the correct value to return. But how do
> I do this in R? I think I would have to put this in our database and do a
> query. The data comes to me in RData format. I could put it all in our
> database via PpgSQL which we already use.
> Note: I think this will be fastest.
>
> Anyway, my first attempt at this was to solve part b of #1 above. Here is
> how my data looks and my first attempt at solving part b of idea #1 above.
> It only takes 0.14 seconds for my mock data, but that is way too slow. The
> major things slowing it down A) the loop, B) the merge statement.
>
> # mock data: this is how it comes to me (raw)
> # in practice I have over 10,000 columns
>
> # the starting 'periods' for my data
> z.dates >
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
>
> nms = c("A","B","C","D")
> # these are the report dates that are the real days the data was available
> rd1 >
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
>
>
"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
>
"20070426","--","--","--","--","--","--","20090319",
>
"--","--","--","--","--","--","--","--"),
> nrow=8,ncol=4)
> dimnames(rd1) = list(z.dates,nms)
>
> # this is the unadjusted raw data, that always has the same dimensions,
> rownames, and colnames as the report dates
> ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
>
> 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
> NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
> NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
> nrow=8,ncol=4)
> dimnames(ua) = list(z.dates,nms)
>
> ################################# change anything below. I can't change
> anything above this line.
>
> # My first attempt at this was to solve part b of #1 above.
> fix = function(x)
> {
> year = substring(x, 1, 4);
> mo = substring(x, 5, 6);
> day = substring(x, 7, 8);
> ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
> }
>
> rd = apply(rd1, 2, fix)
> dimnames(rd) = dimnames(rd)
>
> dt1 <- seq(from =as.Date(z.dates[1]), to =
as.Date("2009-03-25"), by > "day")
> dt = sapply(dt1, as.character)
>
> fin = dt
> ck_rows = length(dt)
> bad = character(0)
> start_t_all = Sys.time()
> for(cn in 1:ncol(ua)){
> uac = ua[,cn]
> tkr = colnames(ua)[cn]
> rdc = rd[,cn]
> ua_rd = cbind(uac,rdc)
> colnames(ua_rd) = c(tkr,'rt_date')
> xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
> xx = as.character(xx1[,2])
> values <- c(NA, xx[!is.na(xx)])
> ind = cumsum(!is.na(xx)) + 1
> y <- values[ind]
> if(ck_rows == length(y)){
> fin = data.frame(fin,y)
> }else{
> bad = c(bad,tkr)
> }
> }
>
> colnames(fin) = c('daily_dates',nms)
>
> # after this I would slice and dice the data into weekly, monthly, etc.
> periodicity as needed, but this leaves it in daily format which is as
> granular as I will get.
>
> print("over all time for loop")
> print(Sys.time()-start_t_all)
>
> Regards,
>
> Ben
>
>
>
[[alternative HTML version deleted]]