thr3ads.net - R help - [R] Calculate daily means from 5-minute interval data [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Richard O'Keefe

2021-Aug-31 05:11 UTC

[R] Calculate daily means from 5-minute interval data

By the time you get the data from the USGS, you are already far past the point
where what the instruments can write is important.
(Obviously an instrument can be sufficiently broken that it cannot
write anything.)
The data for Rogue River that I just downloaded include this comment:

# Data for the following 1 site(s) are contained in this file
#    USGS 04118500 ROGUE RIVER NEAR ROCKFORD, MI
#
-----------------------------------------------------------------------------------
#
# Data provided for site 04118500
#            TS   parameter     Description
#         71932       00060     Discharge, cubic feet per second
#
# Data-value qualification codes included in this output:
#     A  Approved for publication -- Processing and review completed.
#     P  Provisional data subject to revision.
#     e  Value has been estimated.
#
agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd
5s 15s 20d 6s 14n 10s

(I do not know what the last line signifies.)
It is, I think, sufficiently clear that the instrument does not know what
the qualification code is!

After using read.delim to read the file

I note that the timestamps are in a single column, formatted like
"2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M".

After reading the data into R and using
r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M",
                           tz=r$tz_cd)
I get
  agency           site            datetime                     tz
 USGS:33550   Min.   :4118500   Min.   :2020-08-30 00:00:00   EST:33550
              1st Qu.:4118500   1st Qu.:2020-11-25 13:33:45
              Median :4118500   Median :2021-03-08 03:52:30
              Mean   :4118500   Mean   :2021-03-01 07:05:54
              3rd Qu.:4118500   3rd Qu.:2021-06-03 12:41:15
              Max.   :4118500   Max.   :2021-08-30 22:00:00
      flow        qual
 Min.   : 96.5   A  :18052
 1st Qu.:156.0   A:e:  757
 Median :193.0   P  :14741
 Mean   :212.5
 3rd Qu.:237.0
 Max.   :767.0

So for this data set, spanning one year, all the times are in the same time
zone, observations are 15 minutes apart, not 5, and there are no missing
data.  This was obviously the wrong data set.
Oh well, picking an epoch such as> epoch <- min(r$datetime)and then calculating
as.numeric(difftime(timestamp, epoch, units="min")))
will give you a minute count from which determining day number
and bucket within day is trivial arithmetic.

I have attached a plot of the Rogue River flows which should make it
very clear what I mean by saying that means and standard deviations
are not a good way to characterise this kind of data.

The flow is dominated by a series of "bursts" with a fast onset to a
peak
and a slow decay, coming in a range of sizes from quite small to rather
large, separated by gaps of 4 to 45 days.

I'd be looking at
 - how do I *detect* these bursts? (detecting a peak isn't too hard,
   but the peak is not the onset)
 - how do I *characterise* these bursts?
   (and is the onset rate related to the peak size?)
 - what's left after taking the bursts out?
 - can I relate these bursts to something going on upstream?

My usual recommendation is to start with things available in R out of the
box in order to reduce learning time.

On Tue, 31 Aug 2021 at 11:34, Rich Shepard <rshepard at appl-ecosys.com>
wrote:>
> On Tue, 31 Aug 2021, Richard O'Keefe wrote:
>
> > I made up fake data in order to avoid showing untested code. It's
not part
> > of the process I was recommending. I expect data recorded every N
minutes
> > to use NA when something is missing, not to simply not be recorded.
Well
> > and good, all that means is that reshaping the data is not a trivial
call
> > to matrix(). It does not mean that any additional package is needed or
> > appropriate and it does not affect the rest of the process.
>
> Richard,
>
> The instruments in the gauge pipe don't know to write NA when
they're not
> measuring. :-) The outage period varies greatly by location, constituent
> measured, and other unknown factors.
>
> > You will want the POSIXct class, see ?DateTimeClasses. Do you know
whether
> > the time stamps are in universal time or in local time?
>
> The data values are not timestamps. There's one column for date a
second
> colume for time and a third column for time zone (P in the case of the west
> coast.
>
> > Above all, it doesn't affect the point that you probably should
not
> > be doing any of this.
>
> ? (Doesn't require an explanation.)
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rogue River.pdf
Type: application/pdf
Size: 86051 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20210831/30014163/attachment.pdf>

Rich Shepard

2021-Aug-31 12:36 UTC

head link

[R] Calculate daily means from 5-minute interval data

On Tue, 31 Aug 2021, Richard O'Keefe wrote:
> By the time you get the data from the USGS, you are already far past the
point
> where what the instruments can write is important.
Richard,

The data are important because they show what's happened in that period of
record. Don't physicians take a medical history from patients even though
those data are far past the point they occurred?
> agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd
> 5s 15s 20d 6s 14n 10s
>
> (I do not know what the last line signifies.)
The numbers represent the space for each fixed-width field.
> After using read.delim to read the file
>
> I note that the timestamps are in a single column, formatted like
> "2020-08-30 00:15", matching the pattern "%Y-%m-%d
%H:%M".
>
> After reading the data into R and using
> r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M",
>                           tz=r$tz_cd)
And I use emacs to replace the space between columns with commas so the date
and the time are separate.
> So for this data set, spanning one year, all the times are in the same time
> zone, observations are 15 minutes apart, not 5, and there are no missing
> data.  This was obviously the wrong data set.
As I provided when I first asked for suggestions:
sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,130000
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000

The recorded values are 5 minutes apart.

That data set is immaterial for my project but perfect when one needs data
from that gauge station on the Rogue River.
> The flow is dominated by a series of "bursts" with a fast onset
to a peak
> and a slow decay, coming in a range of sizes from quite small to rather
> large, separated by gaps of 4 to 45 days.
And when discharge is controlled by flows through a hydroelectric dam there
is a lot of variability. The pattern is important to fish as well as
regulators.
> I'd be looking at
> - how do I *detect* these bursts? (detecting a peak isn't too hard,
>   but the peak is not the onset)
> - how do I *characterise* these bursts?
>   (and is the onset rate related to the peak size?)
> - what's left after taking the bursts out?
> - can I relate these bursts to something going on upstream?
Well, those questions could be appropriate depending on what questions you
need the data to answer.

Environmental data are quite different from experimental, economic,
financial, and public data (e.g., unemployment, housing costs).

There are always multiple ways to address an analytical need. Thank you for
your contributions.

Stay well,

Rich

Richard O'Keefe

2021-Sep-01 04:05 UTC

head link

[R] Calculate daily means from 5-minute interval data

I wrote:> > By the time you get the data from the USGS, you are already far past
the point
> > where what the instruments can write is important.
Rich Shepard replied:> The data are important because they show what's happened in that period
of
> record. Don't physicians take a medical history from patients even
though
> those data are far past the point they occurred?
You have missed the point.  The issue is not the temporal distance, but the
fact that the data you have are NOT the raw instrumental data and are NOT
subject to the limitations of the recording instruments.  The data you get from
the USGS is not the raw instrumental value, and there is no longer any good
reason for there to be any gaps in it.  Indeed, the Rogue River data I looked
at explicitly includes some flows labelled "Ae" meaning that they are
NOT the
instrumental data at all, but estimated.
> And I use emacs to replace the space between columns with commas so the
date
> and the time are separate.
There does not seem to be any good reason for this.
As I demonstrated, it is easy to convert these timestamps to
POSIXct form, which is good for calculating with.
If you want to extract year, month, day, &c, by far the easiest
way is to convert to POSIXlt form (so keeping the timestamp as a
single field) and then use $<whatever> to extract the
field.> n <- as.POSIXlt("2003.04.05 06:07", format="%Y.%m.%d
%H:%M", tz="UTC")
> n
[1] "2003-04-05 06:07:00 UTC"> c(n$year+1900, n$mon+1, n$mday, n$hour, $min)[1] 2003    4    5    6    7
> > The flow is dominated by a series of "bursts" with a fast
onset to a peak
> > and a slow decay, coming in a range of sizes from quite small to
rather
> > large, separated by gaps of 4 to 45 days.
>
> And when discharge is controlled by flows through a hydroelectric dam there
> is a lot of variability. The pattern is important to fish as well as
> regulators.
And what is important to fish is NOT captured by daily means and standard
deviations.  For what it's worth, my understanding is that most of the dams
on
the Rogue River have been removed, leaving only the Lost Creek Lake one,
and that this has been good for the fish.

Suppose you have a day when there are 16 hours with no water at all flowing,
then 8 hours with 12 cumecs because a dam upstream is discharging.  Then
the daily mean is 4 cumecs, which might look good for fish, but it wasn't.
"Number of minutes below minimum safe level" might be more interesting
for the fish.
>From the data we have alone, we cannot tell which bursts are due toreleases from dams and which have other causes.  Dam releases are under
human control, storms are not.

Looking at the Rogue River data, plotting daily means
- lowers the peaks
- moves them right
- changes the overall shape
Not severely, mind you, but enough to avoid if you don't have to.

By the way, by far the easiest way to do day-wise summaries,
if you really feel you must, is to start with a POSIXct or POSIXlt
column, let's call it r$when, then
  d <- trunc(difftime(r$when, min(r$when), units="days)) + 1
  m <- aggregate(r$flow, by=list(d), FUN=mean)
  plot(m, type="l")
You can plug in other summary functions, not just mean.

Remember:
  for all calculations involving dates and times,
  prefer using the built in date and time classes to
  hacking around the problem

  aggregate() is a good way to compute oddball summaries.

> > - how do I *detect* these bursts? (detecting a peak isn't too
hard,
> >   but the peak is not the onset)
> > - how do I *characterise* these bursts?
> >   (and is the onset rate related to the peak size?)
> > - what's left after taking the bursts out?
> > - can I relate these bursts to something going on upstream?
>
> Well, those questions could be appropriate depending on what questions you
> need the data to answer.
>
> Environmental data are quite different from experimental, economic,
> financial, and public data (e.g., unemployment, housing costs).
>
> There are always multiple ways to address an analytical need. Thank you for
> your contributions.
>
> Stay well,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2021 - Calculate daily means from 5-minute interval data

[R] Calculate daily means from 5-minute interval data

[R] Calculate daily means from 5-minute interval data

[R] Calculate daily means from 5-minute interval data