Karl Ove Hufthammer
2009-Dec-07 15:57 UTC
[R] Subset of time observations where timediff > 60 secs
Dear list members I have a rather large vector (part of a data frame) giving the time (date + time, POSIXct) of observations. The times are irregular (with both small and large jumps) but increasing, and there are several millions of them. I now wish to reduce my data set, so that I only have observations which are at least (for example) 60 seconds apart. Basically, I need (all) the indices of my time variable where the difference in times are at least 60 seconds. I thought this would be a rather simple task, but perhaps I'm tired, for I couldn't figure out how to do it in a even moderately elegant way (not looping over all the values, which is quite slow). This solution seemed sensible: x=cumsum(diff(timevar) %/% 60) ind=c(1,cumsum(rle(x)$lengths)+1) # And perhaps removing the last value but doesn't work, as it only captures the 'first times' in each 60-second interval following the first time value, and thus may include times with values that are closer than 60 seconds. I also considered round.POSIXct and trunc.POSIXct, but these are not appropriate either, for obvious reasons. So, any ideas how to do this in an elegant and efficient way? -- Karl Ove Hufthammer
Gabor Grothendieck
2009-Dec-07 17:27 UTC
[R] Subset of time observations where timediff > 60 secs
There is an example at the end of the Prices and Returns section of the zoo-quickref vignette in the zoo package. library(zoo) vignette("zoo-quickref") If speed is your main concern check this recent thread that was posted on R-sig-finance: http://n4.nabble.com/SUMMARY-Reducing-an-intra-day-dataset-into-one-obs-per-second-td949612.html On Mon, Dec 7, 2009 at 10:57 AM, Karl Ove Hufthammer <karl at huftis.org> wrote:> Dear list members > > I have a rather large vector (part of a data frame) giving the time > (date + time, POSIXct) of observations. The times are irregular (with > both small and large jumps) but increasing, and there are several > millions of them. > > I now wish to reduce my data set, so that I only have observations which > are at least (for example) 60 seconds apart. Basically, I need (all) the > indices of my time variable where the difference in times are at least > 60 seconds. > > I thought this would be a rather simple task, but perhaps I'm tired, for > I couldn't figure out how to do it in a even moderately elegant way (not > looping over all the values, which is quite slow). > > This solution seemed sensible: > > x=cumsum(diff(timevar) %/% 60) > ind=c(1,cumsum(rle(x)$lengths)+1) # And perhaps removing the last value > > but doesn't work, as it only captures the 'first times' in each > 60-second interval following the first time value, and thus may include > times with values that are closer than 60 seconds. > > I also considered round.POSIXct and trunc.POSIXct, but these are not > appropriate either, for obvious reasons. > > So, any ideas how to do this in an elegant and efficient way? > > -- > Karl Ove Hufthammer > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >