I have a year's hydraulic data (discharge, stage height, velocity, etc.) from a USGS monitoring gauge recording values every 5 minutes. The data files contain 90K-93K lines and plotting all these data would produce a solid block of color. What I want are the daily means and standard deviation from these data. As an occasional R user (depending on project needs) I've no idea what packages could be applied to these data frames. There likely are multiple paths to extracting these daily values so summary statistics can be calculated and plotted. I'd appreciate suggestions on where to start to learn how I can do this. TIA, Rich
Hi Rich, Your request is a bit open-ended but here's a suggestion that might help get you an answer. Provide dummy data (e.g. 5-10 lines), say like the contents of a csv file, and calculate by hand what you'd like to see in the plot. (And describe what the plot would look like.) It sounds like what you want could be done in a few lines of R code which would work both on the dummy data and the real data. HTH, Eric On Sun, Aug 29, 2021 at 6:09 PM Rich Shepard <rshepard at appl-ecosys.com> wrote:> I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Jeff Newmiller
2021-Aug-29 16:23 UTC
[R] Calculate daily means from 5-minute interval data
The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. On August 29, 2021 8:08:58 AM PDT, Rich Shepard <rshepard at appl-ecosys.com> wrote:>I have a year's hydraulic data (discharge, stage height, velocity, etc.) >from a USGS monitoring gauge recording values every 5 minutes. The data >files contain 90K-93K lines and plotting all these data would produce a >solid block of color. > >What I want are the daily means and standard deviation from these data. > >As an occasional R user (depending on project needs) I've no idea what >packages could be applied to these data frames. There likely are multiple >paths to extracting these daily values so summary statistics can be >calculated and plotted. I'd appreciate suggestions on where to start to >learn how I can do this. > >TIA, > >Rich > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Andrew Simmons
2021-Aug-29 17:13 UTC
[R] Calculate daily means from 5-minute interval data
Hello, I would suggest something like: date <- seq(as.Date("2020-01-01"), as.Date("2020-12-31"), 1) time <- sprintf("%02d:%02d", rep(0:23, each = 12), seq.int(0, 55, 5)) x <- data.frame( date = rep(date, each = length(time)), time = time ) x$cfs <- stats::rnorm(nrow(x)) cols2aggregate <- "cfs" # add more as necessary S <- split(x[cols2aggregate], x$date) means <- do.call("rbind", lapply(S, colMeans, na.rm = TRUE)) sds <- do.call("rbind", lapply(S, function(xx) sapply(xx, sd, na.rm TRUE))) On Sun, Aug 29, 2021 at 11:09 AM Rich Shepard <rshepard at appl-ecosys.com> wrote:> I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Richard O'Keefe
2021-Aug-30 02:09 UTC
[R] Calculate daily means from 5-minute interval data
Why would you need a package for this?> samples.per.day <- 12*24That's 12 5-minute intervals per hour and 24 hours per day. Generate some fake data.> x <- rnorm(samples.per.day * 365) > length(x)[1] 105120 Reshape the fake data into a matrix where each row represents one 24-hour period.> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)Now we can summarise the rows any way we want. The basic tool here is ?apply. ?rowMeans is said to be faster than using apply to calculate means, so we'll use that. There is no *rowSds so we have to use apply for the standard deviation. I use ?head because I don't want to post tens of thousands of meaningless numbers.> head(rowMeans(m))[1] -0.03510177 0.11817337 0.06725203 -0.03578195 -0.02448077 -0.03033692> head(apply(m, MARGIN=1, FUN=sd))[1] 1.0017718 0.9922920 1.0100550 0.9956810 1.0077477 0.9833144 Now whether this is a *sensible* way to summarise your flow data is a question that a hydrologist would be better placed to answer. I would have started with> plot(density(x))which I just did with some real river data (only a month of it, sigh). Very long tail. Even> plot(density(log(r)))shows a very long tail. Time to plot the data against time. Oh my! All of the long tail came from a single event. There's a period of low flow, then there's a big rainstorm and the flow goes WAY up, then over about two days the flow subsides to a new somewhat higher level. None of this is reflected in means or standard deviations. This is *time series* data, and time series data of a fairly special kind. One thing that might be helpful with your data would simply be> image(log(m))For my one month sample, the spike showed up very clearly that way. Because right now, your first task is to get an idea of what the data look like, and means-and-standard-deviations won't really do that. Oh heck, here's another reason to go with image(log(m)). With image(m) I just see the one big spike. With image(log(m)), I can see that little spikes often start in the afternoon of one day and continue into the morning of the next.>From daily means, it looks like two unusual, but not veryunusual, days. From the image, it's clearly ONE rainfall event that just happens to straddle a day boundary. This is all very basic stuff, which is really the point. You want to use elementary tools to look at the data before you reach for fancy ones. On Mon, 30 Aug 2021 at 03:09, Rich Shepard <rshepard at appl-ecosys.com> wrote:> > I have a year's hydraulic data (discharge, stage height, velocity, etc.) > from a USGS monitoring gauge recording values every 5 minutes. The data > files contain 90K-93K lines and plotting all these data would produce a > solid block of color. > > What I want are the daily means and standard deviation from these data. > > As an occasional R user (depending on project needs) I've no idea what > packages could be applied to these data frames. There likely are multiple > paths to extracting these daily values so summary statistics can be > calculated and plotted. I'd appreciate suggestions on where to start to > learn how I can do this. > > TIA, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Mon, 30 Aug 2021, Richard O'Keefe wrote:>> x <- rnorm(samples.per.day * 365) >> length(x) > [1] 105120 > > Reshape the fake data into a matrix where each row represents one > 24-hour period. > >> m <- matrix(x, ncol=samples.per.day, byrow=TRUE)Richard, Now I understand the need to keep the date and time as a single datetime column; separately dplyr's sumamrize() provides daily means (too many data points to plot over 3-5 years). I reformatted the data to provide a sampledatetime column and a values column. If I correctly understand the output of as.POSIXlt each date and time element is separate, so input such as 2016-03-03 12:00 would now be 2016 03 03 12 00 (I've not read how the elements are separated). (The TZ is not important because all data are either PST or PDT.)> Now we can summarise the rows any way we want. > The basic tool here is ?apply. > ?rowMeans is said to be faster than using apply to calculate means, > so we'll use that. There is no *rowSds so we have to use apply > for the standard deviation. I use ?head because I don't want to > post tens of thousands of meaningless numbers.If I create a matrix using the above syntax the resulting rows contain all recorded values for a specific day. What would be the syntax to collect all values for each month? This would result in 12 rows per year; the periods of record for the five variables availble from that gauge station vary in length. Regards, Rich