thr3ads.net - R help - [R] index instead of loop? [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Ben quant

2012-Mar-05 20:53 UTC

[R] index instead of loop?

Hello,

Does anyone know of a way I can speed this up? Basically I'm attempting to
get the data item on the same row as the report date for each report date
available. In reality, I have over 11k of columns, not just A, B, C, D and
I have to do that over 100 times. My solution is slow, but it works. The
loop is slow because of merge.

# create sample data
z.dates
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
              
"20070426","--","--","--","--","--","--","20090319",
              
"--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################# change anything below.

# My first attempt at this
fix = function(x)
{
  year = substring(x, 1, 4);
  mo = substring(x, 5, 6);
  day = substring(x, 7, 8);
  ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"),
by "day")
dt = sapply(dt1, as.character)

fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
  uac = ua[,cn]
  tkr = colnames(ua)[cn]
  rdc = rd[,cn]
  ua_rd = cbind(uac,rdc)
  colnames(ua_rd) = c(tkr,'rt_date')
  xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
  xx = as.character(xx1[,2])
  values <- c(NA, xx[!is.na(xx)])
  ind = cumsum(!is.na(xx)) + 1
  y <- values[ind]
  if(ck_rows == length(y)){
    fin  = data.frame(fin,y)
  }else{
    bad = c(bad,tkr)
  }
}

colnames(fin) = c('daily_dates',nms)

print("over all time for loop")
print(Sys.time()-start_t_all)

print(fin)


Thanks,

Ben

PS - the real/over-all issue is below, but it is probably too involved to
follow.

On Sat, Mar 3, 2012 at 2:30 PM, Ben quant <ccquant@gmail.com> wrote:
> Hello,
>
> Thank you for your help/advice!
>
> The issue here is speed/efficiency. I can do what I want, but its really
> slow.
>
> The goal is to have the ability to do calculations on my data and have it
> adjusted for look-ahead. I see two ways to do this:
> (I'm open to more ideas. My terminology: Unadjusted = values not
adjusted
> for look-ahead bias; adjusted = values adjusted for look-ahead bias.)
>
> 1) I could a) do calculations on unadjusted values then b) adjust the
> resulting values for look-ahead bias. Here is what I mean:
>  a) I could say the following using time series of val1: [(val1 - val1 4
> periods ago) / val1 4 periods ago] = resultval. ("Periods"
correspond to
> the z.dates in my example below.)
> b) Then I would adjust the resultval for look-ahead based on val1's
> associated report date.
> Note: I don't think this will be the fastest.
>
> 2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
> periods ago] = resultval, but my calculation function would get the
'right'
> values that would have no look-ahead bias. I'm not sure how I would do
> this, but maybe a query starting with the date that I want, indexed to
> appropriate report date indexed to the correct value to return. But how do
> I do this in R? I think I would have to put this in our database and do a
> query. The data comes to me in RData format. I could put it all in our
> database via PpgSQL which we already use.
> Note: I think this will be fastest.
>
> Anyway, my first attempt at this was to solve part b of #1 above. Here is
> how my data looks and my first attempt at solving part b of idea #1 above.
> It only takes 0.14 seconds for my mock data, but that is way too slow. The
> major things slowing it down A) the loop, B) the merge statement.
>
> # mock data: this is how it comes to me (raw)
> # in practice I have over 10,000 columns
>
> # the starting 'periods' for my data
> z.dates >
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
>
> nms = c("A","B","C","D")
> # these are the report dates that are the real days the data was available
> rd1 >
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
>
>
"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
>              
"20070426","--","--","--","--","--","--","20090319",
>              
"--","--","--","--","--","--","--","--"),
>             nrow=8,ncol=4)
> dimnames(rd1) = list(z.dates,nms)
>
> # this is the unadjusted raw data, that always has the same dimensions,
> rownames, and colnames as the report dates
> ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
>
> 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
>               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
>               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
>             nrow=8,ncol=4)
> dimnames(ua) = list(z.dates,nms)
>
> ################################# change anything below. I can't change
> anything above this line.
>
> # My first attempt at this was to solve part b of #1 above.
> fix = function(x)
> {
>   year = substring(x, 1, 4);
>   mo = substring(x, 5, 6);
>   day = substring(x, 7, 8);
>   ifelse(year=="--", "NA", paste(year, mo, day, sep =
"-"))
> }
>
> rd = apply(rd1, 2, fix)
> dimnames(rd) = dimnames(rd)
>
> dt1 <- seq(from =as.Date(z.dates[1]), to =
as.Date("2009-03-25"), by > "day")
> dt = sapply(dt1, as.character)
>
> fin = dt
> ck_rows = length(dt)
> bad = character(0)
> start_t_all = Sys.time()
> for(cn in 1:ncol(ua)){
>   uac = ua[,cn]
>   tkr = colnames(ua)[cn]
>   rdc = rd[,cn]
>   ua_rd = cbind(uac,rdc)
>   colnames(ua_rd) = c(tkr,'rt_date')
>   xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
>   xx = as.character(xx1[,2])
>   values <- c(NA, xx[!is.na(xx)])
>   ind = cumsum(!is.na(xx)) + 1
>   y <- values[ind]
>   if(ck_rows == length(y)){
>     fin  = data.frame(fin,y)
>   }else{
>     bad = c(bad,tkr)
>   }
> }
>
> colnames(fin) = c('daily_dates',nms)
>
> # after this I would slice and dice the data into weekly, monthly, etc.
> periodicity as needed, but this leaves it in daily format which is as
> granular as I will get.
>
> print("over all time for loop")
> print(Sys.time()-start_t_all)
>
> Regards,
>
> Ben
>
>
>
	[[alternative HTML version deleted]]

Rui Barradas

2012-Mar-06 02:48 UTC

head link

[R] index instead of loop?

Hello,
> 
> Mar 05, 2012; 8:53pm ? by Ben quant Ben quant
> Hello,
> 
> Does anyone know of a way I can speed this up? 
> 
Maybe, let's see.
>
> ################################# change anything below.
>
# Yes.
# First, start by using dates, not characters

fdate <- function(x, format="%Y%m%d"){
	DF <- data.frame(x)
	for(i in colnames(DF)){
		DF[, i] <- as.Date(DF[, i], format=format)
		class(DF[, i]) <- "Date"
	}
	DF
}

rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"),
by "day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)

# And vectorize
for(tkr in 1:ncol(ua)){
	x  <- c(rd1[, tkr], as.Date("9999-12-31"))
	inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i
+ 1]))
	sapply(1:length(inxlist), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr
+ 1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))

# Check results
str(fin)
str(fin1)
head(fin)
head(fin1)
tail(fin)
tail(fin1)


Note that 'fin' has facotrs, 'fin1' numerics.
I haven't timed it but I believe it should be faster.

Hope this helps,

Rui Barradas





--
View this message in context:
http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
Sent from the R help mailing list archive at Nabble.com.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Mar 2012 - index instead of loop?

[R] index instead of loop?

[R] index instead of loop?

Reasonably Related Threads