On Tue, 14 Sep 2021, Bert Gunter wrote:> Remove all your as.integer() and as.double() coercions. They are > unnecessary (unless you are preparing input for C code; also, all R > non-integers are double precision) and may be the source of your problems.Bert, When I remove coercions the script produces warnings like this: 1: In mean.default(fps, na.rm = TRUE) : argument is not numeric or logical: returning NA and str(vel) displays this: 'data.frame': 565675 obs. of 6 variables: $ year : chr "2016" "2016" "2016" "2016" ... $ month: int 3 3 3 3 3 3 3 3 3 3 ... $ day : int 3 3 3 3 3 3 3 3 3 3 ... $ hour : chr "12" "12" "12" "12" ... $ min : int 0 10 20 30 40 50 0 10 20 30 ... $ fps : chr "1.74" "1.75" "1.76" "1.81" ... so month, day, and min are recognized as integers but year, hour, and fps are seen as characters. I don't understand why. Regards, Rich
Rich, I have to wonder about how your data was placed in the CSV file based on what you report. functions like read.table() (which is called by read.csv()) ultimately make guesses about what number of columns to expect and what the contents are likely to be. They may just examine the first N entries and make the most compatible choice. The fact that it shows this: 'data.frame': 565675 obs. of 6 variables: $ year : chr "2016" "2016" "2016" "2016" ... $ month: int 3 3 3 3 3 3 3 3 3 3 ... $ day : int 3 3 3 3 3 3 3 3 3 3 ... $ hour : chr "12" "12" "12" "12" ... $ min : int 0 10 20 30 40 50 0 10 20 30 ... $ fps : chr "1.74" "1.75" "1.76" "1.81" ... is odd. It suggests somewhere early in the data, it did not say 2016 or some other entry as an integer but as "2016" or a word like `missing` and not in quotes. Something similar seems to have happened with hour and fps but not the rest. Nonetheless, you did convert back to what you wanted BUT if a single anomalous entry remains then as.integer("missing") would return an NA and as.double("missing") also an NA. So it is wise to check for any unexpected numbers. If the source cannot be changed, then the R program can filter out such cases from your data.frame in various ways. Your way of reading the CSV in was this: vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',', stringsAsFactors = FALSE) The default is the options you added for header=TRUE and sep="," so that is harmless. The default now is not to read in strings as Factors. But what you did not include may be something you can look at given your data may be a bit off. Without the underlying file, we can not trivially diagnose what may be wrong in it. Do you get any error messages when reading in the file? You can specify additional arguments to read.csv() about what, if any, quoting characters are used, what sequences should be recognized as an NA, suggestions of what type each column should be assumed to be, what to do with blank lines, what a comment looks like and so on. One thing I sometimes have had to do is open the original CSV file in EXCEL and examine it in various ways or even change it and save it again. That is beyond the scope of this mailing list so if needed, ask me in private. You have been working on this kind of stuff, but I assume often using other tools outside R and dplyr. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard Sent: Tuesday, September 14, 2021 11:49 AM To: R mailing list <r-help at r-project.org> Subject: Re: [R] Need fresh eyes to see what I'm missing On Tue, 14 Sep 2021, Bert Gunter wrote:> Remove all your as.integer() and as.double() coercions. They are > unnecessary (unless you are preparing input for C code; also, all R > non-integers are double precision) and may be the source of your problems.Bert, When I remove coercions the script produces warnings like this: 1: In mean.default(fps, na.rm = TRUE) : argument is not numeric or logical: returning NA and str(vel) displays this: 'data.frame': 565675 obs. of 6 variables: $ year : chr "2016" "2016" "2016" "2016" ... $ month: int 3 3 3 3 3 3 3 3 3 3 ... $ day : int 3 3 3 3 3 3 3 3 3 3 ... $ hour : chr "12" "12" "12" "12" ... $ min : int 0 10 20 30 40 50 0 10 20 30 ... $ fps : chr "1.74" "1.75" "1.76" "1.81" ... so month, day, and min are recognized as integers but year, hour, and fps are seen as characters. I don't understand why. Regards, Rich ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Input problems of this sort are often caused by stray or extra characters (commas, dashes, etc.) in the input files, which then can trigger automatic conversion to character. Excel files are somewhat notorious for this. A couple of comments, and then I'll quit, as others should have greater insight (and may correct any of my errors). 1.> as.numeric("1,")[1] NA Warning message: NAs introduced by coercion So if a stray character caused your "numeric" input to be read in as character, then you converted it with as.numeric() (do not use as.integer or as.double), you get that error. 2. So I would say that you need to check those columns in your data frame that were read in as character instead of numeric. I'd also check the others with unique() or some such just to make sure they have the handful of right values. One way of doing this would be to look for NA's in as.numeric, as above. But I thought you said you did this already and found none, so I don't get it. Other approaches would be to examine your .csv file with ?count.fields or try reading it with ?read.delim. Any discrepancies or errors you get from these may help you to pinpoint problems like stray characters, to many fields in a line, etc. 3. As for your "fps as factors" question, note that:> as.numeric(factor("3"))[1] 1 So it depends on how you read stuff in. The answer should be "no" with read.csv(..., stringsAsFactors = FALSE), but I'm not sure what all you did or what kind of junk in your .csv file may be causing R to misread the numeric data as character. As I said, others may be wiser and correct any errors in my "advice." This is as far as I can go -- and it may already be too far. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Sep 14, 2021 at 9:01 AM Rich Shepard <rshepard at appl-ecosys.com> wrote:> > On Tue, 14 Sep 2021, Bert Gunter wrote: > > > Remove all your as.integer() and as.double() coercions. They are > > unnecessary (unless you are preparing input for C code; also, all R > > non-integers are double precision) and may be the source of your problems. > > Bert, > > When I remove coercions the script produces warnings like this: > 1: In mean.default(fps, na.rm = TRUE) : > argument is not numeric or logical: returning NA > > and str(vel) displays this: > 'data.frame': 565675 obs. of 6 variables: > $ year : chr "2016" "2016" "2016" "2016" ... > $ month: int 3 3 3 3 3 3 3 3 3 3 ... > $ day : int 3 3 3 3 3 3 3 3 3 3 ... > $ hour : chr "12" "12" "12" "12" ... > $ min : int 0 10 20 30 40 50 0 10 20 30 ... > $ fps : chr "1.74" "1.75" "1.76" "1.81" ... > > so month, day, and min are recognized as integers but year, hour, and fps > are seen as characters. I don't understand why. > > Regards, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.