On Tue, 31 Aug 2021, Richard O'Keefe wrote:> I made up fake data in order to avoid showing untested code. It's not part > of the process I was recommending. I expect data recorded every N minutes > to use NA when something is missing, not to simply not be recorded. Well > and good, all that means is that reshaping the data is not a trivial call > to matrix(). It does not mean that any additional package is needed or > appropriate and it does not affect the rest of the process.Richard, The instruments in the gauge pipe don't know to write NA when they're not measuring. :-) The outage period varies greatly by location, constituent measured, and other unknown factors.> You will want the POSIXct class, see ?DateTimeClasses. Do you know whether > the time stamps are in universal time or in local time?The data values are not timestamps. There's one column for date a second colume for time and a third column for time zone (P in the case of the west coast.> Above all, it doesn't affect the point that you probably should not > be doing any of this.? (Doesn't require an explanation.) Rich
Am I seeing an odd aspect to this discussion. There are many ways to solve problems and some may be favored by some more than others. All require some examination of the data so it can be massaged into shape for the processes that follow. If you insist on using the matrix method to arrange that each row or column has the data you want, then, yes, you need to guarantee all your data is present and in the right order. If some may be missing, you may want to write a program that generates all possible dates in order and interpolates them back (or into a copy more likely) so all the missing items are represented and show up as an NA or whatever you want. You may also want to check all dates are in order with no duplicates and anything else that makes sense and then you are free to ask the vector to be seen as a matrix with N columns or rows. For many, the solution is much cleaner to use constructs that may be more resistant to imperfections or allow them to be treated better. I would probably use tidyverse functionality these days but can easily understand people preferring base R or other packages. I have done similar analyses of real data gathered from streams of various chemicals and levels taken at various times and depths including times no measures happened and times there were more than one measure. It is thus much more robust to use methods like group_by and then apply other such verbs already being done grouped and especially when the next steps involved making plots with ggplot. It was rather trivial for example, to replace multiple measures by the average of the measures. And many of my plots are faceted by variables which is not trivial to do in base R. I suggest not falling in love with the first way you think of and try to bend everything to fit. Yes, some methods may be quite a bit more efficient but rarely do I run into problems even with quite large collections of data like a quarter million rows with dozens of columns, including odd columns like the output of some analysis. And note the current set of data may be extended with more over time or you may get other data collected that would not necessarily work well with a hard-coded method but might easily adjust to a new method. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard Sent: Monday, August 30, 2021 7:34 PM To: R Project Help <r-help at r-project.org> Subject: Re: [R] Calculate daily means from 5-minute interval data On Tue, 31 Aug 2021, Richard O'Keefe wrote:> I made up fake data in order to avoid showing untested code. It's not > part of the process I was recommending. I expect data recorded every N > minutes to use NA when something is missing, not to simply not be > recorded. Well and good, all that means is that reshaping the data is > not a trivial call to matrix(). It does not mean that any additional > package is needed or appropriate and it does not affect the rest of theprocess. Richard, The instruments in the gauge pipe don't know to write NA when they're not measuring. :-) The outage period varies greatly by location, constituent measured, and other unknown factors.> You will want the POSIXct class, see ?DateTimeClasses. Do you know > whether the time stamps are in universal time or in local time?The data values are not timestamps. There's one column for date a second colume for time and a third column for time zone (P in the case of the west coast.> Above all, it doesn't affect the point that you probably should not be > doing any of this.? (Doesn't require an explanation.) Rich ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I do not wish to express any opinion on what should be done or how. But... 1. I assume that when data are missing, they are missing -- i.e. simply not present in the data. So there will be possibly several/many in succession missing rows of data corresponding to those times, right? (Apologies for being a bit dumb about this, but I always need to check that what I think is blindingly obvious really is). 2. Do note that when one takes daily averages/sd's/whatever summaries of data that, because of missingness, may be calculated from possibly quite different numbers of data points -- are whole days sometimes missing?? -- then all the summaries (e.g. means) are not created equal: summaries created from more data are more "trustworthy" and should receive "appropriately" greater weight than those created from fewer. Makes sense, right? So I suspect that this may not be as straightforward as you think -- you may wish to find a local statistician with some experience in these sorts of things to help you deal with them. Up to you, of course. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Aug 30, 2021 at 4:34 PM Rich Shepard <rshepard at appl-ecosys.com> wrote:> > On Tue, 31 Aug 2021, Richard O'Keefe wrote: > > > I made up fake data in order to avoid showing untested code. It's not part > > of the process I was recommending. I expect data recorded every N minutes > > to use NA when something is missing, not to simply not be recorded. Well > > and good, all that means is that reshaping the data is not a trivial call > > to matrix(). It does not mean that any additional package is needed or > > appropriate and it does not affect the rest of the process. > > Richard, > > The instruments in the gauge pipe don't know to write NA when they're not > measuring. :-) The outage period varies greatly by location, constituent > measured, and other unknown factors. > > > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether > > the time stamps are in universal time or in local time? > > The data values are not timestamps. There's one column for date a second > colume for time and a third column for time zone (P in the case of the west > coast. > > > Above all, it doesn't affect the point that you probably should not > > be doing any of this. > > ? (Doesn't require an explanation.) > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Richard O'Keefe
2021-Aug-31 05:11 UTC
[R] Calculate daily means from 5-minute interval data
By the time you get the data from the USGS, you are already far past the point where what the instruments can write is important. (Obviously an instrument can be sufficiently broken that it cannot write anything.) The data for Rogue River that I just downloaded include this comment: # Data for the following 1 site(s) are contained in this file # USGS 04118500 ROGUE RIVER NEAR ROCKFORD, MI # ----------------------------------------------------------------------------------- # # Data provided for site 04118500 # TS parameter Description # 71932 00060 Discharge, cubic feet per second # # Data-value qualification codes included in this output: # A Approved for publication -- Processing and review completed. # P Provisional data subject to revision. # e Value has been estimated. # agency_cd site_no datetime tz_cd 71932_00060 71932_00060_cd 5s 15s 20d 6s 14n 10s (I do not know what the last line signifies.) It is, I think, sufficiently clear that the instrument does not know what the qualification code is! After using read.delim to read the file I note that the timestamps are in a single column, formatted like "2020-08-30 00:15", matching the pattern "%Y-%m-%d %H:%M". After reading the data into R and using r$datetime <- as.POSIXct(r$datetime, format="%Y-%m-%d %H:%M", tz=r$tz_cd) I get agency site datetime tz USGS:33550 Min. :4118500 Min. :2020-08-30 00:00:00 EST:33550 1st Qu.:4118500 1st Qu.:2020-11-25 13:33:45 Median :4118500 Median :2021-03-08 03:52:30 Mean :4118500 Mean :2021-03-01 07:05:54 3rd Qu.:4118500 3rd Qu.:2021-06-03 12:41:15 Max. :4118500 Max. :2021-08-30 22:00:00 flow qual Min. : 96.5 A :18052 1st Qu.:156.0 A:e: 757 Median :193.0 P :14741 Mean :212.5 3rd Qu.:237.0 Max. :767.0 So for this data set, spanning one year, all the times are in the same time zone, observations are 15 minutes apart, not 5, and there are no missing data. This was obviously the wrong data set. Oh well, picking an epoch such as> epoch <- min(r$datetime)and then calculating as.numeric(difftime(timestamp, epoch, units="min"))) will give you a minute count from which determining day number and bucket within day is trivial arithmetic. I have attached a plot of the Rogue River flows which should make it very clear what I mean by saying that means and standard deviations are not a good way to characterise this kind of data. The flow is dominated by a series of "bursts" with a fast onset to a peak and a slow decay, coming in a range of sizes from quite small to rather large, separated by gaps of 4 to 45 days. I'd be looking at - how do I *detect* these bursts? (detecting a peak isn't too hard, but the peak is not the onset) - how do I *characterise* these bursts? (and is the onset rate related to the peak size?) - what's left after taking the bursts out? - can I relate these bursts to something going on upstream? My usual recommendation is to start with things available in R out of the box in order to reduce learning time. On Tue, 31 Aug 2021 at 11:34, Rich Shepard <rshepard at appl-ecosys.com> wrote:> > On Tue, 31 Aug 2021, Richard O'Keefe wrote: > > > I made up fake data in order to avoid showing untested code. It's not part > > of the process I was recommending. I expect data recorded every N minutes > > to use NA when something is missing, not to simply not be recorded. Well > > and good, all that means is that reshaping the data is not a trivial call > > to matrix(). It does not mean that any additional package is needed or > > appropriate and it does not affect the rest of the process. > > Richard, > > The instruments in the gauge pipe don't know to write NA when they're not > measuring. :-) The outage period varies greatly by location, constituent > measured, and other unknown factors. > > > You will want the POSIXct class, see ?DateTimeClasses. Do you know whether > > the time stamps are in universal time or in local time? > > The data values are not timestamps. There's one column for date a second > colume for time and a third column for time zone (P in the case of the west > coast. > > > Above all, it doesn't affect the point that you probably should not > > be doing any of this. > > ? (Doesn't require an explanation.) > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-------------- next part -------------- A non-text attachment was scrubbed... Name: Rogue River.pdf Type: application/pdf Size: 86051 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20210831/30014163/attachment.pdf>