Richard O'Keefe
2021-Aug-30 02:09 UTC
[R] Calculate daily means from 5-minute interval data
Why would you need a package for this?> samples.per.day <- 12*24That's 12 5-minute intervals per hour and 24 hours per day. Generate some fake data.> x <- rnorm(samples.per.day * 365) > length(x)[1] 105120 Reshape the fake data into a matrix where each row represents one 24-hour period.> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)Now we can summarise the rows any way we want. The basic tool here is ?apply. ?rowMeans is said to be faster than using apply to calculate means, so we'll use that. There is no *rowSds so we have to use apply for the standard deviation. I use ?head because I don't want to post tens of thousands of meaningless numbers.> head(rowMeans(m))[1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077 -0.03033692> head(apply(m, MARGIN=1, FUN=sd))[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144 Now whether this is a *sensible* way to summarise your flow data is a question that a hydrologist would be better placed to answer. I would have started with> plot(density(x))which I just did with some real river data (only a month of it, sigh). Very long tail. Even> plot(density(log(r)))shows a very long tail. Time to plot the data against time. Oh my! All of the long tail came from a single event. There's a period of low flow, then there's a big rainstorm and the flow goes WAY up, then over about two days the flow subsides to a new somewhat higher level. None of this is reflected in means or standard deviations. This is *time series* data, and time series data of a fairly special kind. One thing that might be helpful with your data would simply be> image(log(m))For my one month sample, the spike showed up very clearly that way. Because right now, your first task is to get an idea of what the data look like, and means-and-standard-deviations won't really do that. Oh heck, here's another reason to go with image(log(m)). With image(m) I just see the one big spike. With image(log(m)), I can see that little spikes often start in the afternoon of one day and continue into the morning of the next.>From daily means, it looks like two unusual, but not veryunusual, days. From the image, it's clearly ONE rainfall event that just happens to straddle a day boundary. This is all very basic stuff, which is really the point. You want to use elementary tools to look at the data before you reach for fancy ones. On Mon, 30 Aug 2021 at 03:09, Rich Shepard <rshepard at appl-ecosys.com> wrote:> > I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jeff Newmiller
2021-Aug-30 02:47 UTC
[R] Calculate daily means from 5-minute interval data
IMO assuming periodicity is a bad practice for this. Missing timestamps happen too, and there is no reason to build a broken analysis process. On August 29, 2021 7:09:01 PM PDT, Richard O'Keefe <raoknz at gmail.com> wrote:>Why would you need a package for this? >> samples.per.day <- 12*24 > >That's 12 5-minute intervals per hour and 24 hours per day. >Generate some fake data. > >> x <- rnorm(samples.per.day * 365) >> length(x) >[1] 105120 > >Reshape the fake data into a matrix where each row represents one >24-hour period. > >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE) > >Now we can summarise the rows any way we want. >The basic tool here is ?apply. >?rowMeans is said to be faster than using apply to calculate means, >so we'll use that. There is no *rowSds so we have to use apply >for the standard deviation. I use ?head because I don't want to >post tens of thousands of meaningless numbers. > >> head(rowMeans(m)) >[1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077 -0.03033692 >> head(apply(m, MARGIN=1, FUN=sd)) >[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144 > >Now whether this is a *sensible* way to summarise your flow data is a question >that a hydrologist would be better placed to answer. I would have started with >> plot(density(x)) >which I just did with some real river data (only a month of it, sigh). >Very long tail. >Even >> plot(density(log(r))) >shows a very long tail. Time to plot the data against time. Oh my! >All of the long tail came from a single event. >There's a period of low flow, then there's a big rainstorm and the >flow goes WAY up, then over about two days the flow subsides to a new >somewhat higher level. > >None of this is reflected in means or standard deviations. >This is *time series* data, and time series data of a fairly special kind. > >One thing that might be helpful with your data would simply be >> image(log(m)) >For my one month sample, the spike showed up very clearly that way. >Because right now, your first task is to get an idea of what the data >look like, and means-and-standard-deviations won't really do that. > >Oh heck, here's another reason to go with image(log(m)). >With image(m) I just see the one big spike. >With image(log(m)), I can see that little spikes often start in the >afternoon of one day and continue into the morning of the next. >From daily means, it looks like two unusual, but not very >unusual, days. From the image, it's clearly ONE rainfall event >that just happens to straddle a day boundary. > >This is all very basic stuff, which is really the point. You want to use >elementary tools to look at the data before you reach for fancy ones. > > >On Mon, 30 Aug 2021 at 03:09, Rich Shepard <rshepard at appl-ecosys.com> wrote: >> >> I have a year's hydraulic data (discharge, stage height, velocity, etc.) >> from a USGS monitoring gauge recording values every 5 minutes. The data >> files contain 90K-93K lines and plotting all these data would produce a >> solid block of color. >> >> What I want are the daily means and standard deviation from these data. >> >> As an occasional R user (depending on project needs) I've no idea what >> packages could be applied to these data frames. There likely are multiple >> paths to extracting these daily values so summary statistics can be >> calculated and plotted. I'd appreciate suggestions on where to start to >> learn how I can do this. >> >> TIA, >> >> Rich >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
On Mon, 30 Aug 2021, Richard O'Keefe wrote:> Why would you need a package for this? >> samples.per.day <- 12*24 > > That's 12 5-minute intervals per hour and 24 hours per day. > Generate some fake data.Richard, The problem is that there are days with fewer than 12 recorded values for various reasons. When testing algorithms I use small subsets of actual data rather than fake data. Thanks for your detailed procedure. Regards, Rich