Hello, I do not understand the correct way to approach the following problem in R. I have observations of pairs of variables, v1, o1, v2, o2, etc, observed every 30 seconds. What I would like to do is compute the correlation matrix, but not for all my data, just for, say 5 minutes or 1 hour chunks. In sql, what I would say is select id, date_trunc('hour'::text, ts) as tshour, corr(n1,o1) as corr1 from raw30s where id = 1201087 and (ts between 'Mar 1, 2007' and 'Apr 1, 2007') group by id,tshour order by id,tshour; I've pulled data from PostgreSQL into R, and have a dataframe containing a timestamp column, v, and o (both numeric). I created an grouping index for every 5 minutes along these lines: obsfivemin <- trunc(obsts,units="hours") +( floor( (obsts$min / 5 ) ) * 5 * 60 ) (where obsts is the sql timestamp converted into a DateTime object) Then I tried aggregate(df,by=obsfivemin,cor), but that seemed to pass just a single column at a time to cor, not the entire data frame. It worked for mean and sum, but not cor. In desperation, I tried looping over the different 5 minute levels and computing cor, but I'm so R-clueless I couldn't even figure out how to assign to a variable inside of that loop! code such as for (f in fivemin){ output[f] <- cor(df[grouper==f,]); } failed, as I couldn't figure out how to initialize output so that output[f] would accept the output of cor. Any help or steering towards the proper R-way would be appreciated. Regards, James Marca -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Hi James, It would really help if you gave us a sample of the data you are working with. The following is not tested, because I don't have your data and am too lazy to construct a similar example dataset for you, but it might get you started. You can try using a for loop along the lines of output <- data.frame(obsfivemin = obsfivemin, 5min.cor vector(length=length(obsfivemin))) for (f in fivemin){ output$5min.cor[obsfivemin==f] <- cor(df[obsfivemin==f, c("v", "o")]) } Or you can try with the plyr package something like cor.dat <- function(df) { cor(df[,c("v", "o")]) } library(plyr) dlply(df, obsfivemin, cor.dat) Good luck, Ista On Tue, Mar 9, 2010 at 9:36 PM, James Marca <jmarca at translab.its.uci.edu> wrote:> Hello, > > I do not understand the correct way to approach the following problem > in R. > > I have observations of pairs of variables, v1, o1, v2, o2, etc, > observed every 30 seconds. ?What I would like to do is compute the > correlation matrix, but not for all my data, just for, say 5 minutes > or 1 hour chunks. > > In sql, what I would say is > > ? ?select id, date_trunc('hour'::text, ts) as tshour, corr(n1,o1) as corr1 > ? ?from raw30s > ? ?where id = 1201087 ?and > ? ? ? ? ?(ts between 'Mar 1, 2007' and 'Apr 1, 2007') > ? ?group by id,tshour order by id,tshour; > > > I've pulled data from PostgreSQL into R, and have a dataframe > containing a timestamp column, v, and o (both numeric). > > I created an grouping index for every 5 minutes along these lines: > > ? ?obsfivemin <- trunc(obsts,units="hours") > ? ? ? ? ? ? ? ? ? +( floor( (obsts$min / 5 ) ) * 5 * 60 ) > > (where obsts is the sql timestamp converted into a DateTime object) > > Then I tried aggregate(df,by=obsfivemin,cor), but that seemed to pass > just a single column at a time to cor, not the entire data frame. ?It > worked for mean and sum, but not cor. > > In desperation, I tried looping over the different 5 minute levels and > computing cor, but I'm so R-clueless I couldn't even figure out how to > assign to a variable inside of that loop! > > code such as > > ? ?for (f in fivemin){ > ? ? ? ?output[f] <- cor(df[grouper==f,]); } > > failed, as I couldn't figure out how to initialize output so that > output[f] would accept the output of cor. > > Any help or steering towards the proper R-way would be appreciated. > > Regards, > > James Marca > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org
James, you may post your question to the R-SIG finance group with a small example. If I understand your problem correctly it's like converting tick data of financial time series into aggregates. (to 1-minute, hourly, daily ... data sets ). There are packages available for this kind of task that are very fast and efficient. ( no looping ! ) regards Helmuth -- View this message in context: http://n4.nabble.com/Help-with-aggregate-and-cor-tp1586878p1587069.html Sent from the R help mailing list archive at Nabble.com.
The sqldf package can be used to manipulate R data frames with SQL statements. See http://sqldf.googlecode.com On Tue, Mar 9, 2010 at 9:36 PM, James Marca <jmarca at translab.its.uci.edu> wrote:> Hello, > > I do not understand the correct way to approach the following problem > in R. > > I have observations of pairs of variables, v1, o1, v2, o2, etc, > observed every 30 seconds. ?What I would like to do is compute the > correlation matrix, but not for all my data, just for, say 5 minutes > or 1 hour chunks. > > In sql, what I would say is > > ? ?select id, date_trunc('hour'::text, ts) as tshour, corr(n1,o1) as corr1 > ? ?from raw30s > ? ?where id = 1201087 ?and > ? ? ? ? ?(ts between 'Mar 1, 2007' and 'Apr 1, 2007') > ? ?group by id,tshour order by id,tshour; > > > I've pulled data from PostgreSQL into R, and have a dataframe > containing a timestamp column, v, and o (both numeric). > > I created an grouping index for every 5 minutes along these lines: > > ? ?obsfivemin <- trunc(obsts,units="hours") > ? ? ? ? ? ? ? ? ? +( floor( (obsts$min / 5 ) ) * 5 * 60 ) > > (where obsts is the sql timestamp converted into a DateTime object) > > Then I tried aggregate(df,by=obsfivemin,cor), but that seemed to pass > just a single column at a time to cor, not the entire data frame. ?It > worked for mean and sum, but not cor. > > In desperation, I tried looping over the different 5 minute levels and > computing cor, but I'm so R-clueless I couldn't even figure out how to > assign to a variable inside of that loop! > > code such as > > ? ?for (f in fivemin){ > ? ? ? ?output[f] <- cor(df[grouper==f,]); } > > failed, as I couldn't figure out how to initialize output so that > output[f] would accept the output of cor. > > Any help or steering towards the proper R-way would be appreciated. > > Regards, > > James Marca > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >