Hi R people,
I have recently had to use some old code which utilizes data.matrix and
found a functionality which I found unintuitive. We are converting a
dataframe containing numerical values stored as character strings to a
matrix using the data.matrix function.
This does unfortunately not yield a numerical matrix consisting of the same
numbers stored in the original matrix - see for example below:
> df <-
data.frame(a=c("1","2","7","10"),b=c("1","7","10","19"),c=c("a","b","c","a"),d=c("1","7","a","b"))> df
a b c d
1 1 1 a 1
2 2 7 b 7
3 7 10 c a
4 10 19 a b
> data.matrix(df)
a b c d
[1,] 1 1 1 1
[2,] 3 4 2 2
[3,] 4 2 3 3
[4,] 2 3 1 4
The current implementation of data.matrix iterates over each column in the
dataframe and utilizes the following code to convert a column into
integers:
if (is.character(xi)) {
frame[[i]] <- as.integer(as.factor(xi))
next
}
While I kind of understand the reasoning here, i.e. you avoid NA's when the
characters are non-numerical, this returns a (to me) unintuitive result
when providing the function with a dataframe containing numerical
characters. This makes the values of any two columns output from
data.matrix very difficult to compare, and not easily traceable to the
original data. Was this really the original intent behind the function?
I would like to propose a change, which instead checks whether the
data.matrix function can convert a column to integers without utilizing the
as.factor intermediary. Otherwise it will use the current implementation.
if (is.character(xi)) {
frame[[i]] <- tryCatch({
as.integer(xi)
}, warning = function(war) {
f = as.integer(factor(xi))
return(f)
})
next
}
This change results in the following outputs from the data.matrix function
(using the earlier df):
> data.matrix_new(df)
a b c d
[1,] 1 1 1 1
[2,] 2 7 2 2
[3,] 7 10 3 3
[4,] 10 19 1 4
Thanks for considering this!
Best,
Adam Marstrand
[[alternative HTML version deleted]]