thr3ads.net - R help - [R] Need fresh eyes to see what I'm missing [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Rich Shepard

2021-Sep-14 15:48 UTC

[R] Need fresh eyes to see what I'm missing

On Tue, 14 Sep 2021, Bert Gunter wrote:
> Remove all your as.integer() and as.double() coercions. They are
> unnecessary (unless you are preparing input for C code; also, all R
> non-integers are double precision) and may be the source of your problems.
Bert,

When I remove coercions the script produces warnings like this:
1: In mean.default(fps, na.rm = TRUE) :
   argument is not numeric or logical: returning NA

and str(vel) displays this:
'data.frame':	565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016"
"2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76"
"1.81" ...

so month, day, and min are recognized as integers but year, hour, and fps
are seen as characters. I don't understand why.

Regards,

Rich

Avi Gross

2021-Sep-14 16:41 UTC

head link

[R] Need fresh eyes to see what I'm missing

Rich,

I have to wonder about how your data was placed in the CSV file based on
what you report.

functions like read.table() (which is called by read.csv()) ultimately make
guesses about what number of columns to expect and what the contents are
likely to be. They may just examine the first N entries and make the most
compatible choice. The fact that it shows this:

'data.frame':	565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016"
"2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76"
"1.81" ...

is odd. It suggests somewhere early in the data, it did not say 2016 or some
other entry  as an integer but as "2016" or a word like `missing` and
not in
quotes.

Something similar seems to have happened with hour and fps but not the rest.

Nonetheless, you did convert back to what you wanted BUT if a single
anomalous entry remains then as.integer("missing") would return an NA
and
as.double("missing") also an NA. So it is wise to check for any
unexpected
numbers. If the source cannot be changed, then the R program can filter out
such cases from your data.frame in various ways.

Your way of reading the CSV in was this:

vel <- read.csv('../data/water/vel.dat', header = TRUE, sep =
',',
stringsAsFactors = FALSE)

The default is the options you added for header=TRUE and sep="," so
that is
harmless. The default now is not to read in strings as Factors. But what you
did not include may be something you can look at given your data may be a
bit off. 

Without the underlying file, we can not trivially diagnose what may be wrong
in it. Do you get any error messages when reading in the file?  You can
specify additional arguments to read.csv() about what, if any, quoting
characters are used, what sequences should be recognized as an NA,
suggestions of what type each column should be assumed to be, what to do
with blank lines, what a comment looks like  and so on. 

One thing I sometimes have had to do is open the original CSV file in EXCEL
and examine it in various ways or even change it and save it again. That is
beyond the scope of this mailing list so if needed, ask me in private. You
have been working on this kind of stuff, but I assume often using other
tools outside R and dplyr.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rich Shepard
Sent: Tuesday, September 14, 2021 11:49 AM
To: R mailing list <r-help at r-project.org>
Subject: Re: [R] Need fresh eyes to see what I'm missing

On Tue, 14 Sep 2021, Bert Gunter wrote:
> Remove all your as.integer() and as.double() coercions. They are 
> unnecessary (unless you are preparing input for C code; also, all R 
> non-integers are double precision) and may be the source of your problems.
Bert,

When I remove coercions the script produces warnings like this:
1: In mean.default(fps, na.rm = TRUE) :
   argument is not numeric or logical: returning NA

and str(vel) displays this:
'data.frame':	565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016"
"2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76"
"1.81" ...

so month, day, and min are recognized as integers but year, hour, and fps
are seen as characters. I don't understand why.

Regards,

Rich

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2021-Sep-14 16:59 UTC

head link

[R] Need fresh eyes to see what I'm missing

Input problems of this sort are often caused by stray or extra
characters (commas, dashes, etc.) in the input files, which then can
trigger automatic conversion to character. Excel files are somewhat
notorious for this.

A couple of comments, and then I'll quit, as others should have
greater insight (and may correct any of my errors).

1.> as.numeric("1,")[1] NA
Warning message:
NAs introduced by coercion

So if a stray character caused your "numeric" input to be read in as
character, then you converted it with as.numeric() (do not use
as.integer or as.double), you get that error.

2. So I would say that you need to check those columns in your data
frame that were read in as character instead of numeric.  I'd also
check the others with unique() or some such just to make sure they
have the handful of right values.

One way of doing this would be to look for NA's in as.numeric, as
above. But I thought you said you did
this already and found none, so I don't get it. Other approaches would
be to examine your .csv file with ?count.fields or try reading it with
?read.delim. Any discrepancies or errors you get from these may help
you to pinpoint problems like stray characters, to many fields in a
line, etc.

3. As for your "fps as factors" question, note
that:> as.numeric(factor("3"))[1] 1

So it depends on how you read stuff in. The answer should be "no" with
read.csv(..., stringsAsFactors = FALSE), but I'm not sure what all you
did or what kind of junk in your .csv file may be causing R to misread
the numeric data as character.

As I said, others may be wiser and correct any errors in my "advice."
This is as far as I can go -- and it may already be too far.

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Tue, Sep 14, 2021 at 9:01 AM Rich Shepard <rshepard at appl-ecosys.com>
wrote:>
> On Tue, 14 Sep 2021, Bert Gunter wrote:
>
> > Remove all your as.integer() and as.double() coercions. They are
> > unnecessary (unless you are preparing input for C code; also, all R
> > non-integers are double precision) and may be the source of your
problems.
>
> Bert,
>
> When I remove coercions the script produces warnings like this:
> 1: In mean.default(fps, na.rm = TRUE) :
>    argument is not numeric or logical: returning NA
>
> and str(vel) displays this:
> 'data.frame':   565675 obs. of  6 variables:
>   $ year : chr  "2016" "2016" "2016"
"2016" ...
>   $ month: int  3 3 3 3 3 3 3 3 3 3 ...
>   $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
>   $ hour : chr  "12" "12" "12" "12"
...
>   $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
>   $ fps  : chr  "1.74" "1.75" "1.76"
"1.81" ...
>
> so month, day, and min are recognized as integers but year, hour, and fps
> are seen as characters. I don't understand why.
>
> Regards,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2021 - Need fresh eyes to see what I'm missing

[R] Need fresh eyes to see what I'm missing

[R] Need fresh eyes to see what I'm missing

[R] Need fresh eyes to see what I'm missing