thr3ads.net - R help - [R] Failed to convert data to numeric [Mar 2025]

If this information is useful, please help other people find it:
Share via:

@vi@e@gross m@iii@g oii gm@ii@com

2025-Mar-03 17:19 UTC

[R] Failed to convert data to numeric

The second solution Ivan offers looks good, and a bit more general than his
first that simply removes one non-visible character.

It begs the question of why the data has that anomaly at all. Did the data come
from a text-processing environment where it was going to wrap there and was
protected?

As Ivan points out, there is a question of what format you expect numbers in and
what "as.numeric"  should do when it does not see an integer or
floating point number.

If you test it, you can see that as.numeric ignores leading and/or trailing
blanks and tabs and even newlines sometimes and some other irrelevant ASCII
characters. In that spirit, the UNICODE character being mentioned should be one
that any UNICODE-aware version of as.numeric should ignore.

But UNICODE supports a much wider vision of numeric so that there are
numeric-equivalent symbols in other languages and groupings and even something
like the symbols for numerals in light or dark circles count as numbers. Those
can likely safely be excluded in this context but perhaps not in a more general
function.

But I note as.numeric seems to handle scientific notation as in:

as.numeric("1.23e8")
[1] 1.23e+08

So a single instance of the letters "e" and "E" must be
supported if your numbers in string form may contain them. Further, the E cannot
be the first or last letter. It cannot have adjacent whitespace. Still, if you
are OK with getting an NA in such situations, it should be OK.

It gets worse. Hexadecimal is supported:
> as.numeric("0X12")[1] 18

You now need to support the letters x and X. But only if preceded by a zero! 

It gets still worse as any characters from [0-9A-F] are supported:
> as.numeric("0xAE")[1] 174

There may be other scenarios it handles. The filter applied might remove valid
numbers so you may want to carefully document it if your program only handles a
restricted set.

A possible idea might be to make two passes and only  evaluate any resulting NA
from as.numeric() by doing a substitution like Ivan suggests to try to fix any
broken ones. But note it may fix too much as "1.2 e 5" might become
"1.2e5" as spaces are removed.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Ivan Krylov
via R-help
Sent: Monday, March 3, 2025 3:09 AM
To: Christofer Bogaso <bogaso.christofer at gmail.com>
Cc: r-help <r-help at r-project.org>
Subject: Re: [R] Failed to convert data to numeric

? Mon, 3 Mar 2025 13:21:31 +0530
Christofer Bogaso <bogaso.christofer at gmail.com> ?????:
> Is there any way to remove all possible "Unicode character" that
may
> be present in the array at once?
Define a range of characters you consider acceptable, and you'll be
able to use regular expressions to remove everything else. For example,
the following expression should remove everything except ASCII digits,
dots, and hyphen-minus:

gsub('[^0-9.-]+', '', dat2)

There is a brief introduction to regular expressions in ?regex and
various online resources such as <https://regex101.com/>.

-- 
Best regards,
Ivan

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Rolf Turner

2025-Mar-03 21:45 UTC

head link

[R] Failed to convert data to numeric

This issue looks like grist for the R Inferno.

cheers,

Rolf


On Mon, 3 Mar 2025 12:19:02 -0500
<avi.e.gross at gmail.com> wrote:
> The second solution Ivan offers looks good, and a bit more general
> than his first that simply removes one non-visible character.
> 
> It begs the question of why the data has that anomaly at all. Did the
> data come from a text-processing environment where it was going to
> wrap there and was protected?
> 
> As Ivan points out, there is a question of what format you expect
> numbers in and what "as.numeric"  should do when it does not see
an
> integer or floating point number. 
> 
> If you test it, you can see that as.numeric ignores leading and/or
> trailing blanks and tabs and even newlines sometimes and some other
> irrelevant ASCII characters. In that spirit, the UNICODE character
> being mentioned should be one that any UNICODE-aware version of
> as.numeric should ignore.
> 
> But UNICODE supports a much wider vision of numeric so that there are
> numeric-equivalent symbols in other languages and groupings and even
> something like the symbols for numerals in light or dark circles
> count as numbers. Those can likely safely be excluded in this context
> but perhaps not in a more general function.
> 
> But I note as.numeric seems to handle scientific notation as in:
> 
> as.numeric("1.23e8")
> [1] 1.23e+08
> 
> So a single instance of the letters "e" and "E" must be
supported if
> your numbers in string form may contain them. Further, the E cannot
> be the first or last letter. It cannot have adjacent whitespace.
> Still, if you are OK with getting an NA in such situations, it should
> be OK.
> 
> It gets worse. Hexadecimal is supported:
> 
> > as.numeric("0X12")
> [1] 18
> 
> You now need to support the letters x and X. But only if preceded by
> a zero! 
> 
> It gets still worse as any characters from [0-9A-F] are supported:
> 
> > as.numeric("0xAE")
> [1] 174
> 
> There may be other scenarios it handles. The filter applied might
> remove valid numbers so you may want to carefully document it if your
> program only handles a restricted set.
> 
> A possible idea might be to make two passes and only  evaluate any
> resulting NA from as.numeric() by doing a substitution like Ivan
> suggests to try to fix any broken ones. But note it may fix too much
> as "1.2 e 5" might become "1.2e5" as spaces are
removed.
> 
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Ivan
Krylov
> via R-help Sent: Monday, March 3, 2025 3:09 AM
> To: Christofer Bogaso <bogaso.christofer at gmail.com>
> Cc: r-help <r-help at r-project.org>
> Subject: Re: [R] Failed to convert data to numeric
> 
> ? Mon, 3 Mar 2025 13:21:31 +0530
> Christofer Bogaso <bogaso.christofer at gmail.com> ?????:
> 
> > Is there any way to remove all possible "Unicode character"
that may
> > be present in the array at once?
> 
> Define a range of characters you consider acceptable, and you'll be
> able to use regular expressions to remove everything else. For
> example, the following expression should remove everything except
> ASCII digits, dots, and hyphen-minus:
> 
> gsub('[^0-9.-]+', '', dat2)
> 
> There is a brief introduction to regular expressions in ?regex and
> various online resources such as <https://regex101.com/>.
> 


-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Stats. Dep't. (secretaries) phone:
         +64-9-373-7599 ext. 89622
Home phone: +64-9-480-4619

Richard O'Keefe

2025-Mar-03 22:33 UTC

head link

[R] Failed to convert data to numeric

The zero-width no-break space character is used as the Byte Order
Mark.  That is, an official function for it at the beginning of a
character sequence
is to indicate whether you have 2-byte or 4-byte big-endian or
little-endian encoding.  It was not intended for use in UTF-8, where
there is nothing for
it to tell you, but Microsoft jumped in with all six feet and said
"hey, we'll use this to indicate that it's Unicode in UTF-8 and not
one of the hundreds
of other 8-bit coded character sets."  I've lost count of the number
of programs that have choked because they were given a BOM where they
didn't expect one.

So there is no great mystery about why there is a BOM at the beginning
of this particular string.
The real mystery is why it was there and NOT at the beginning of all the others.

I suggest that it is a good idea to remove the BOM character from the
beginning of microsofted strings,
but a bad idea to remove any other character.  If you are given bad
data like "Bond-007" when you
expect a number, you want to know about it, and not mistake it for
-007.  Still less do you want a
phone number like "+61 3 555 1234 x77" to be mistaken for a plain
number "613555123477".

On Tue, 4 Mar 2025 at 06:24, <avi.e.gross at gmail.com>
wrote:>
> The second solution Ivan offers looks good, and a bit more general than his
first that simply removes one non-visible character.
>
> It begs the question of why the data has that anomaly at all. Did the data
come from a text-processing environment where it was going to wrap there and was
protected?
>
> As Ivan points out, there is a question of what format you expect numbers
in and what "as.numeric"  should do when it does not see an integer or
floating point number.
>
> If you test it, you can see that as.numeric ignores leading and/or trailing
blanks and tabs and even newlines sometimes and some other irrelevant ASCII
characters. In that spirit, the UNICODE character being mentioned should be one
that any UNICODE-aware version of as.numeric should ignore.
>
> But UNICODE supports a much wider vision of numeric so that there are
numeric-equivalent symbols in other languages and groupings and even something
like the symbols for numerals in light or dark circles count as numbers. Those
can likely safely be excluded in this context but perhaps not in a more general
function.
>
> But I note as.numeric seems to handle scientific notation as in:
>
> as.numeric("1.23e8")
> [1] 1.23e+08
>
> So a single instance of the letters "e" and "E" must be
supported if your numbers in string form may contain them. Further, the E cannot
be the first or last letter. It cannot have adjacent whitespace. Still, if you
are OK with getting an NA in such situations, it should be OK.
>
> It gets worse. Hexadecimal is supported:
>
> > as.numeric("0X12")
> [1] 18
>
> You now need to support the letters x and X. But only if preceded by a
zero!
>
> It gets still worse as any characters from [0-9A-F] are supported:
>
> > as.numeric("0xAE")
> [1] 174
>
> There may be other scenarios it handles. The filter applied might remove
valid numbers so you may want to carefully document it if your program only
handles a restricted set.
>
> A possible idea might be to make two passes and only  evaluate any
resulting NA from as.numeric() by doing a substitution like Ivan suggests to try
to fix any broken ones. But note it may fix too much as "1.2 e 5"
might become "1.2e5" as spaces are removed.
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Ivan
Krylov via R-help
> Sent: Monday, March 3, 2025 3:09 AM
> To: Christofer Bogaso <bogaso.christofer at gmail.com>
> Cc: r-help <r-help at r-project.org>
> Subject: Re: [R] Failed to convert data to numeric
>
> ? Mon, 3 Mar 2025 13:21:31 +0530
> Christofer Bogaso <bogaso.christofer at gmail.com> ?????:
>
> > Is there any way to remove all possible "Unicode character"
that may
> > be present in the array at once?
>
> Define a range of characters you consider acceptable, and you'll be
> able to use regular expressions to remove everything else. For example,
> the following expression should remove everything except ASCII digits,
> dots, and hyphen-minus:
>
> gsub('[^0-9.-]+', '', dat2)
>
> There is a brief introduction to regular expressions in ?regex and
> various online resources such as <https://regex101.com/>.
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2025 - Failed to convert data to numeric

[R] Failed to convert data to numeric

[R] Failed to convert data to numeric

[R] Failed to convert data to numeric