Vikas N Kumar
2008-Feb-22 21:15 UTC
[R] Corrected : Efficient writing of calculation involving each element of 2 data frames
Hi
I have 2 data.frames each of the same number of rows (approximately 30000 or
more entries).
They also have the same number of columns, lets say 2.
One column has the date, the other column has a double precision number. Let
the column names be V1, V2.
Now I want to calculate the correlation of the 2 sets of data, for the last
100 days for every day available in the data.frames.
My code looks like this :
# Let df1, and df2 be the 2 data frames with the required data
## begin code snippet
my_corr <- c();
for ( i_start in 100:nrow(df1))
my_corr[i_start-99] <-
cor(x=df1[(i_start-99):i_start,"V2"],y=df2[(i_start-99):i_start,"V2"])
## end of code snippet
This runs very slowly, and takes more than an hour to run if I have to
calculate correlation between 10 data sets leaving me with 45 runs of this
snippet or taking more than 30 minutes to run.
Is there an efficient way to write this piece of code where I can get it
to run faster ?
If I do something similar in Excel, it is much faster. But I have to use R,
since this is a part of a bigger program.
Any help will be appreciated.
Thanks and Regards
Vikas
--
http://www.vikaskumar.org/
[[alternative HTML version deleted]]
jim holtman
2008-Feb-23 00:06 UTC
[R] Corrected : Efficient writing of calculation involving each element of 2 data frames
take a look at the 'embed' function. With the you can create a matrix with the added shifted in each column. You would want to do embed(your.data,100). On Fri, Feb 22, 2008 at 4:15 PM, Vikas N Kumar <vikasnkumar at users.sourceforge.net> wrote:> Hi > > I have 2 data.frames each of the same number of rows (approximately 30000 or > more entries). > They also have the same number of columns, lets say 2. > One column has the date, the other column has a double precision number. Let > the column names be V1, V2. > > Now I want to calculate the correlation of the 2 sets of data, for the last > 100 days for every day available in the data.frames. > > My code looks like this : > # Let df1, and df2 be the 2 data frames with the required data > ## begin code snippet > > my_corr <- c(); > for ( i_start in 100:nrow(df1)) > my_corr[i_start-99] <- > cor(x=df1[(i_start-99):i_start,"V2"],y=df2[(i_start-99):i_start,"V2"]) > ## end of code snippet > > This runs very slowly, and takes more than an hour to run if I have to > calculate correlation between 10 data sets leaving me with 45 runs of this > snippet or taking more than 30 minutes to run. > > Is there an efficient way to write this piece of code where I can get it > to run faster ? > > If I do something similar in Excel, it is much faster. But I have to use R, > since this is a part of a bigger program. > > Any help will be appreciated. > > Thanks and Regards > Vikas > > > > > > > -- > http://www.vikaskumar.org/ > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
Felix Andrews
2008-Feb-24 07:17 UTC
[R] Corrected : Efficient writing of calculation involving each element of 2 data frames
Vikas,
Please provide reproducible code when posting to this list (read the
posting guide).
You need to vectorise your code to make it run faster.
Here are some timing results for 3 different vectorised methods. All 3
methods take under 30 seconds to run on a pair of 30,000-vectors. The
results suggests that using rollapply (from the zoo package) will work
well. You might get it to go a bit faster by working with embed, but
as I understand it that requires explicitly forming a large matrix,
which may cause problems with big datasets. Also it is not as
intuitive as the zoo function.
df.a <- data.frame(Date=as.Date(1:30000), Value=rnorm(30000))
df.b <- data.frame(Date=as.Date(1:30000), Value=rnorm(30000))
# METHOD 1: sapply
starts <- seq(1, nrow(df.a)-99)
system.time(
cors <- sapply(starts, function(i) {
if (i %% 1000 == 0) print(i)
subset <- i + 0:99
cor(df.a[subset, 2], df.b[subset, 2])
})
)
# user system elapsed
# 29.97 0.31 31.22
# METHOD 2: zoo::rollapply
library(zoo)
z.ab <- zoo(cbind(a=df.a$Value, b=df.b$Value), order.by=df.a$Date)
system.time(
zcors <- unlist(rollapply(z.ab, width=100, FUN=function(z)
cor(z[,1], z[,2]), by.column = FALSE, align="right")
)
)
# user system elapsed
# 14.86 0.39 16.02
all.equal(cors, coredata(zcors)) # TRUE
# METHOD 3: embed / sapply
mat.a <- embed(df.a$Value, 100)
mat.b <- embed(df.b$Value, 100)
system.time(
ecors <- sapply(1:nrow(mat.a), function(i)
cor(mat.a[i,], mat.b[i,]))
)
# user system elapsed
# 12.30 0.04 12.73
all.equal(cors, ecors) # TRUE
On Sat, Feb 23, 2008 at 8:15 AM, Vikas N Kumar
<vikasnkumar at users.sourceforge.net> wrote:> Hi
>
> I have 2 data.frames each of the same number of rows (approximately 30000
or
> more entries).
> They also have the same number of columns, lets say 2.
> One column has the date, the other column has a double precision number.
Let
> the column names be V1, V2.
>
> Now I want to calculate the correlation of the 2 sets of data, for the
last
> 100 days for every day available in the data.frames.
>
> My code looks like this :
> # Let df1, and df2 be the 2 data frames with the required data
> ## begin code snippet
>
> my_corr <- c();
> for ( i_start in 100:nrow(df1))
> my_corr[i_start-99] <-
>
cor(x=df1[(i_start-99):i_start,"V2"],y=df2[(i_start-99):i_start,"V2"])
> ## end of code snippet
>
> This runs very slowly, and takes more than an hour to run if I have to
> calculate correlation between 10 data sets leaving me with 45 runs of this
> snippet or taking more than 30 minutes to run.
>
> Is there an efficient way to write this piece of code where I can get it
> to run faster ?
>
> If I do something similar in Excel, it is much faster. But I have to use
R,
> since this is a part of a bigger program.
>
> Any help will be appreciated.
>
> Thanks and Regards
> Vikas
>
>
>
>
>
>
> --
> http://www.vikaskumar.org/
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Felix Andrews / $B0BJ!N)(B
PhD candidate
Integrated Catchment Assessment and Management Centre
The Fenner School of Environment and Society
The Australian National University (Building 48A), ACT 0200
Beijing Bag, Locked Bag 40, Kingston ACT 2604
http://www.neurofractal.org/felix/
3358 543D AAC6 22C2 D336 80D9 360B 72DD 3E4C F5D8