Jeff Newmiller
2021-Aug-29 16:23 UTC
[R] Calculate daily means from 5-minute interval data
The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. On August 29, 2021 8:08:58 AM PDT, Rich Shepard <rshepard at appl-ecosys.com> wrote:>I have a year's hydraulic data (discharge, stage height, velocity, etc.) >from a USGS monitoring gauge recording values every 5 minutes. The data >files contain 90K-93K lines and plotting all these data would produce a >solid block of color. > >What I want are the daily means and standard deviation from these data. > >As an occasional R user (depending on project needs) I've no idea what >packages could be applied to these data frames. There likely are multiple >paths to extracting these daily values so summary statistics can be >calculated and plotted. I'd appreciate suggestions on where to start to >learn how I can do this. > >TIA, > >Rich > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
On Sun, 29 Aug 2021, Jeff Newmiller wrote:> The general idea is to create a "grouping" column with repeated values for > each day, and then to use aggregate to compute your combined results. The > dplyr package's group_by/summarise functions can also do this, and there > are also proponents of the data.table package which is high performance > but tends to depend on altering data in-place unlike most other R data > handling functions. > > Also pay attention to missing data... if you have any then you will need > to consider whether you want the strictness of na.rm=FALSE or > permissiveness of na.rm=TRUE for your aggregation functions.Jeff, Thank you. Yes, there are missing data as sometimes the equipment fails, or there's some other reason why some samples are missing. Grouping on each day is just what I need. I'll re-learn dplyr and take a look at data.table. Regards, Rich
Jeff Newmiller
2021-Aug-29 16:56 UTC
[R] Calculate daily means from 5-minute interval data
You may find something useful on handling timestamp data here: https://jdnewmil.github.io/ On August 29, 2021 9:23:31 AM PDT, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:>The general idea is to create a "grouping" column with repeated values for each day, and then to use aggregate to compute your combined results. The dplyr package's group_by/summarise functions can also do this, and there are also proponents of the data.table package which is high performance but tends to depend on altering data in-place unlike most other R data handling functions. > >Also pay attention to missing data... if you have any then you will need to consider whether you want the strictness of na.rm=FALSE or permissiveness of na.rm=TRUE for your aggregation functions. > >On August 29, 2021 8:08:58 AM PDT, Rich Shepard <rshepard at appl-ecosys.com> wrote: >>I have a year's hydraulic data (discharge, stage height, velocity, etc.) >>from a USGS monitoring gauge recording values every 5 minutes. The data >>files contain 90K-93K lines and plotting all these data would produce a >>solid block of color. >> >>What I want are the daily means and standard deviation from these data. >> >>As an occasional R user (depending on project needs) I've no idea what >>packages could be applied to these data frames. There likely are multiple >>paths to extracting these daily values so summary statistics can be >>calculated and plotted. I'd appreciate suggestions on where to start to >>learn how I can do this. >> >>TIA, >> >>Rich >> >>______________________________________________ >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. >-- Sent from my phone. Please excuse my brevity.
On Sun, 29 Aug 2021, Jeff Newmiller wrote:> The general idea is to create a "grouping" column with repeated values for > each day, and then to use aggregate to compute your combined results. The > dplyr package's group_by/summarise functions can also do this, and there > are also proponents of the data.table package which is high performance > but tends to depend on altering data in-place unlike most other R data > handling functions.Jeff, I've read a number of docs discussing dplyr's summerize and group_by functions (including that section of Hadley's 'R for Data Science' book, yet I'm missing something; I think that I need to separate the single sampdate column into colums for year, month, and day and group_by year/month summarizing within those groups. The data are of this format: sampdate,samptime,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,130000 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,120000 My curent script is: -------8<-------------- library('tidyverse') discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = TRUE) discharge$sampdate <- as.Date(discharge$sampdate) discharge$cfs <- as.numeric(discharge$cfs, length = 6) # use dplyr.summarize grouped by date # need to separate sampdate into %Y-%M-%D in order to group_by the month? by_month <- discharge %>% group_by(sampdate ... summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs)) ---------------->8-------- and the results are:> str(discharge)'data.frame': 93254 obs. of 3 variables: $ sampdate: Date, format: "2020-08-26" "2020-08-26" ... $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 133 138 143 148 ... $ cfs : num 176 156 165 161 156 154 144 137 142 142 ...> ls()[1] "by_month" "discharge"> by_month# A tibble: 93,254 ? 3 # Groups: sampdate [322] sampdate samptime cfs <date> <fct> <dbl> 1 2020-08-26 09:30 176 2 2020-08-26 09:35 156 3 2020-08-26 09:40 165 4 2020-08-26 09:45 161 5 2020-08-26 09:50 156 6 2020-08-26 09:55 154 7 2020-08-26 10:00 144 8 2020-08-26 10:05 137 9 2020-08-26 10:10 142 10 2020-08-26 10:15 142 # ? with 93,244 more rows I don't know why the discharge values are truncated to 3 digits when they're 6 digits in the input data. Suggested readings appreciated, Rich
Jeff Newmiller
2021-Aug-31 23:51 UTC
[R] Calculate daily means from 5-minute interval data
Never use stringsAsFactors on uncleaned data. For one thing you give a factor to as.Date and it tries to make sense of the integer representation, not the character representation. library(dplyr) dta <- read.csv( text "sampdate,samptime,cfs 2020-08-26,09:30,136000 2020-08-26,09:35,126000 2020-08-26,09:40,130000 2020-08-26,09:45,128000 2020-08-26,09:50,126000 2020-08-26,09:55,125000 2020-08-26,10:00,121000 2020-08-26,10:05,117000 2020-08-26,10:10,120000 ", stringsAsFactors = FALSE) dtad <- ( dta %>% group_by( sampdate ) %>% summarise( exp_value = mean(cfs, na.rm = TRUE) , Count = n() ) ) On August 31, 2021 2:11:05 PM PDT, Rich Shepard <rshepard at appl-ecosys.com> wrote:>On Sun, 29 Aug 2021, Jeff Newmiller wrote: > >> The general idea is to create a "grouping" column with repeated values for >> each day, and then to use aggregate to compute your combined results. The >> dplyr package's group_by/summarise functions can also do this, and there >> are also proponents of the data.table package which is high performance >> but tends to depend on altering data in-place unlike most other R data >> handling functions. > >Jeff, > >I've read a number of docs discussing dplyr's summerize and group_by >functions (including that section of Hadley's 'R for Data Science' book, yet >I'm missing something; I think that I need to separate the single sampdate >column into colums for year, month, and day and group_by year/month >summarizing within those groups. > >The data are of this format: >sampdate,samptime,cfs >2020-08-26,09:30,136000 >2020-08-26,09:35,126000 >2020-08-26,09:40,130000 >2020-08-26,09:45,128000 >2020-08-26,09:50,126000 >2020-08-26,09:55,125000 >2020-08-26,10:00,121000 >2020-08-26,10:05,117000 >2020-08-26,10:10,120000 > >My curent script is: > >-------8<-------------- >library('tidyverse') > >discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = TRUE) >discharge$sampdate <- as.Date(discharge$sampdate) >discharge$cfs <- as.numeric(discharge$cfs, length = 6) > ># use dplyr.summarize grouped by date > ># need to separate sampdate into %Y-%M-%D in order to group_by the month? >by_month <- discharge %>% > group_by(sampdate ... >summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs)) >---------------->8-------- > >and the results are: > >> str(discharge) >'data.frame': 93254 obs. of 3 variables: > $ sampdate: Date, format: "2020-08-26" "2020-08-26" ... > $ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 133 138 143 148 ... > $ cfs : num 176 156 165 161 156 154 144 137 142 142 ... >> ls() >[1] "by_month" "discharge" >> by_month ># A tibble: 93,254 ? 3 ># Groups: sampdate [322] > sampdate samptime cfs > <date> <fct> <dbl> > 1 2020-08-26 09:30 176 > 2 2020-08-26 09:35 156 > 3 2020-08-26 09:40 165 > 4 2020-08-26 09:45 161 > 5 2020-08-26 09:50 156 > 6 2020-08-26 09:55 154 > 7 2020-08-26 10:00 144 > 8 2020-08-26 10:05 137 > 9 2020-08-26 10:10 142 >10 2020-08-26 10:15 142 ># ? with 93,244 more rows > >I don't know why the discharge values are truncated to 3 digits when they're >6 digits in the input data. > >Suggested readings appreciated, > >Rich > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.