The data file begins this way: year,month,day,hour,min,fps 2016,03,03,12,00,1.74 2016,03,03,12,10,1.75 2016,03,03,12,20,1.76 2016,03,03,12,30,1.81 2016,03,03,12,40,1.79 2016,03,03,12,50,1.75 2016,03,03,13,00,1.78 2016,03,03,13,10,1.81 The script to process it: library('tidyverse') vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE) vel$year <- as.integer(vel$year) vel$month <- as.integer(vel$month) vel$day <- as.integer(vel$day) vel$hour <- as.integer(vel$hour) vel$min <- as.integer(vel$min) vel$fps <- as.double(vel$fps, length = 6) # use dplyr to filter() by year, month, day; summarize() to get monthly # means vel_by_month = vel %>% group_by(year, month) %>% summarize(flow = mean(fps, na.rm = TRUE)) R's display after running the script:> source('vel.R')`summarise()` has grouped output by 'year'. You can override using the `.groups` argument. Warning messages: 1: In eval(ei, envir) : NAs introduced by coercion 2: In eval(ei, envir) : NAs introduced by coercion 3: In eval(ei, envir) : NAs introduced by coercion The dataframe created by the read.csv() command:> head(vel)year month day hour min fps 1 2016 3 3 12 0 1.74 2 2016 3 3 12 10 1.75 3 2016 3 3 12 20 1.76 4 2016 3 3 12 30 1.81 5 2016 3 3 12 40 1.79 6 2016 3 3 12 50 1.75 and the resulting grouping:> vel_by_month# A tibble: 67 ? 3 # Groups: year [8] year month flow <int> <int> <dbl> 1 0 NA NaN 2 2016 3 2.40 3 2016 4 3.00 4 2016 5 2.86 5 2016 6 2.51 6 2016 7 2.18 7 2016 8 1.89 8 2016 9 1.38 9 2016 10 1.73 10 2016 11 2.01 # ? with 57 more rows I cannot find why line 1 is there. Other data sets don't produce this result. TIA, Rich
Before you create vel_by_month you can check vel for NAs and NaNs by sum(is.na(vel)) sum(unlist(lapply(vel,is.nan))) HTH, Eric On Tue, Sep 14, 2021 at 6:21 PM Rich Shepard <rshepard at appl-ecosys.com> wrote:> The data file begins this way: > year,month,day,hour,min,fps > 2016,03,03,12,00,1.74 > 2016,03,03,12,10,1.75 > 2016,03,03,12,20,1.76 > 2016,03,03,12,30,1.81 > 2016,03,03,12,40,1.79 > 2016,03,03,12,50,1.75 > 2016,03,03,13,00,1.78 > 2016,03,03,13,10,1.81 > > The script to process it: > library('tidyverse') > vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',', > stringsAsFactors = FALSE) > vel$year <- as.integer(vel$year) > vel$month <- as.integer(vel$month) > vel$day <- as.integer(vel$day) > vel$hour <- as.integer(vel$hour) > vel$min <- as.integer(vel$min) > vel$fps <- as.double(vel$fps, length = 6) > > # use dplyr to filter() by year, month, day; summarize() to get monthly > # means > vel_by_month = vel %>% > group_by(year, month) %>% > summarize(flow = mean(fps, na.rm = TRUE)) > > R's display after running the script: > > source('vel.R') > `summarise()` has grouped output by 'year'. You can override using the > `.groups` argument. > Warning messages: > 1: In eval(ei, envir) : NAs introduced by coercion > 2: In eval(ei, envir) : NAs introduced by coercion > 3: In eval(ei, envir) : NAs introduced by coercion > > The dataframe created by the read.csv() command: > > head(vel) > year month day hour min fps > 1 2016 3 3 12 0 1.74 > 2 2016 3 3 12 10 1.75 > 3 2016 3 3 12 20 1.76 > 4 2016 3 3 12 30 1.81 > 5 2016 3 3 12 40 1.79 > 6 2016 3 3 12 50 1.75 > > and the resulting grouping: > > vel_by_month > # A tibble: 67 ? 3 > # Groups: year [8] > year month flow > <int> <int> <dbl> > 1 0 NA NaN > 2 2016 3 2.40 > 3 2016 4 3.00 > 4 2016 5 2.86 > 5 2016 6 2.51 > 6 2016 7 2.18 > 7 2016 8 1.89 > 8 2016 9 1.38 > 9 2016 10 1.73 > 10 2016 11 2.01 > # ? with 57 more rows > > I cannot find why line 1 is there. Other data sets don't produce this > result. > > TIA, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Rich, I reproduced your problem on my re-arranging the code the mailer mangled. I tried variations like not using pipes or changing what it is grouped by and they all show your results on the abbreviated data with the error: `summarise()` has grouped output by 'year'. You can override using the `.groups` argument. I think I fixed summarise() but it makes me wonder if there is an inconsistency introduced along the way as what you used is supposed to work and has worked for me in the past. I note the man page for summarise() mentions that the .groups="..." is experimental and a tad confusing: I changed your code to this by telling it to keep the grouping in the output the same: vel_by_month = vel %>% group_by(year, month) %>% summarise(flow = mean(fps, na.rm = TRUE), .groups="keep") The change from your code is the addition at the very end of the .groups="keep" argument. Since I used your limited data, this is all I get:> vel_by_month# A tibble: 1 x 3 # Groups: year, month [1] year month flow <int> <int> <dbl> 1 2016 3 1.77 For now, all I did was shut summarise() up. Not having the rest of your data, the question is where your NA and Nan are introduced. If the change I made above does not resolve it, then as others suggested, you begin by looking at your data more carefully perhaps starting with the .CSV file and then the data structures in R, along the lines of what you were shown. I find the table() function useful for categorical data with limited choices as it would spit out the anomaly as happening once. I see your point about needing fresh eyes. My eyes do not see what you did wrong but am just following clues you may be ignoring. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard Sent: Tuesday, September 14, 2021 11:21 AM To: r-help at r-project.org Subject: [R] Need fresh eyes to see what I'm missing The data file begins this way: year,month,day,hour,min,fps 2016,03,03,12,00,1.74 2016,03,03,12,10,1.75 2016,03,03,12,20,1.76 2016,03,03,12,30,1.81 2016,03,03,12,40,1.79 2016,03,03,12,50,1.75 2016,03,03,13,00,1.78 2016,03,03,13,10,1.81 The script to process it: library('tidyverse') vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE) vel$year <- as.integer(vel$year) vel$month <- as.integer(vel$month) vel$day <- as.integer(vel$day) vel$hour <- as.integer(vel$hour) vel$min <- as.integer(vel$min) vel$fps <- as.double(vel$fps, length = 6) # use dplyr to filter() by year, month, day; summarize() to get monthly # means vel_by_month = vel %>% group_by(year, month) %>% summarize(flow = mean(fps, na.rm = TRUE)) R's display after running the script:> source('vel.R')`summarise()` has grouped output by 'year'. You can override using the `.groups` argument. Warning messages: 1: In eval(ei, envir) : NAs introduced by coercion 2: In eval(ei, envir) : NAs introduced by coercion 3: In eval(ei, envir) : NAs introduced by coercion The dataframe created by the read.csv() command:> head(vel)year month day hour min fps 1 2016 3 3 12 0 1.74 2 2016 3 3 12 10 1.75 3 2016 3 3 12 20 1.76 4 2016 3 3 12 30 1.81 5 2016 3 3 12 40 1.79 6 2016 3 3 12 50 1.75 and the resulting grouping:> vel_by_month# A tibble: 67 ? 3 # Groups: year [8] year month flow <int> <int> <dbl> 1 0 NA NaN 2 2016 3 2.40 3 2016 4 3.00 4 2016 5 2.86 5 2016 6 2.51 6 2016 7 2.18 7 2016 8 1.89 8 2016 9 1.38 9 2016 10 1.73 10 2016 11 2.01 # ? with 57 more rows I cannot find why line 1 is there. Other data sets don't produce this result. TIA, Rich ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.