Dear Prof. Brian Ripley,
Thank you for your quick reply. It might be that we have two problems here.
(1) BUG in as.data.frame.matrix()
I must admit I was not aware that of the not-converting-to-factor
inconsistence here
(at the end of this mail more on this).
(2) BUG in as.matrix.data.frame()
My problem was that as.matrix.data.frame REALLY CHANGES THE DATA which
DEFINITTELY IS A BUG
It is totally legal to have a character column in a dataframe
> x <- as.data.frame(x=I(rep('"', 3)))
as expected it is> unclass(x)
$x
[1] "\"" "\"" "\""
attr(,"class")
[1] "AsIs"
attr(,"row.names")
[1] "1" "2" "3"
here a first problem, well this is only printing> x
x
1 \\\"
2 \\\"
3 \\\"
but now look at this
correct:> x[1,1]
[1] "\""
wrong:> as.matrix(x)[1,1]
[1] "\\\""
this is not only a printing problem
correct:> cat(x[1,1], "\n")
"
wrong:> cat(as.matrix(x)[1,1], "\n")
\"
but DEFINITELY WRONG as can be seen in
> as.matrix(x)[1,1] == x[1,1]
[1] FALSE
It is caused, because as.matrix.data.frame makes use of format, and format
behaves as it does.
If, "that is what R-like languages do", then
(either) this convention about what format() does is not sub-optimal but
wrong
(or) format MUST NEVER be used within R routines EXCEPT FOR PRINTING,
i.e. formatting something and storing it or returning it from a
function is dangerous.
Even the use for cat() may be dangerous, as cat() as a side effect
may store data,
as in write() or write.table(). This deserves a BIG warning in the
documentation of format.
Here is a list of functions of package:base making use of format
> collect <- character()
> for (i in ls("package:base")){
+ if ( any(grep("format[(]", deparse(get(i,
pos="package:base"))))
+ || any(grep("format.default[(]", deparse(get(i,
pos="package:base"))))
+ )collect <- c(collect, i)
+ }> collect
[1] "add1.default" "add1.glm"
"add1.lm"
"all.equal.numeric"
[5] "anovalist.lm" "as.matrix.data.frame"
"drop1.default"
"drop1.glm"
[9] "drop1.lm" "format.char"
"format.default"
"format.pval"
[13] "help.search" "hist.default"
"legend"
"ls.print"
[17] "print.aov" "print.aovlist"
"print.coefmat"
"print.dummy.coef"
[21] "print.glm" "print.glm.null"
"print.htest"
"print.lm"
[25] "print.mtable" "print.summary.glm"
"print.summary.lm"
"print.summary.lm.null"
[29] "print.tables.aov" "print.ts"
"quantile.default"
"step"
[33] "str.default" "summary.aov"
"summary.data.frame"
"summary.default"
[37] "summary.infl" "symnum"
To my understanding, at least as.matrix.data.frame() needs a fix.
Back to automatic conversion of characters to factors.
After fixing as.data.frame.matrix(), the following comparisions will be TRUE
or FALSE, depending whether mat is a numeric matrix or a character matrix:
all( unclass(as.data.frame(mat)) == unclass(mat) )
all( mat == sapply(as.data.frame(mat), FUN=function(x)x) )
Obviously automatic conversion to factors is a design decision long ago, but
I am not convinced yet, however.
The need to maintain attribute "AsIs" just to grant that a basic data
type
(character) remains unchanged, appears to be somewhat dangerous. So both,
character data and factors need maintaining, in EACH FUNCTION that might
work on dataframes. Uff! It is easy to predict that errors will happen:
Some systematic testing ...
> char <- letters[1:2]
> fac <- factor(char)
> dd <- data.frame(char=I(char), fac=fac)
reveals that
> dd[,"char"] <- char
> dd[,"char"]
[1] a b
Levels: a b
but> dd$char <- char
> dd$char
[1] "a" "b"
So .Primitive("$<-") is inconsistent with automatically converting
chars to
factors, as it allows to insert a pure character column into a dataframe,
which neither has attribute "AsIs" nor class "factor".
Handling I() is risky as well:
> mat <- matrix(letters, 2, 2)
> dimnames(mat) <- list(c(1:2), c("x","y"))
> mat
x y
1 "a" "c"
2 "b" "d"
> dd <- data.frame(I(mat))
> ddd
I.mat..x I.mat..y
1 a a
2 b b
3 c c
doesn't look too bad,
but> dimnames(dd)
[[1]]
[1] "1" "2"
[[2]]
[1] "I.mat."
> str(ddd)
`data.frame': 3 obs. of 1 variable:
$ I.mat.: chr [1:3, 1:2] "a" "b" "c"
"a" "b" "c"
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "x" "y"
..- attr(*, "class")= chr "AsIs">
So this dataframe is no longer a simple list with each element representing
one column, and thus
> sapply(ddd, FUN=function(x)x)
I.mat.
[1,] "a"
[2,] "b"
[3,] "c"
[4,] "a"
[5,] "b"
[6,] "c"
is no longer a matrix.
Back to "AsIs":
> ddd[,1]
x y
[1,] "a" "a"
[2,] "b" "b"
[3,] "c" "c"
attr(,"class")
[1] "AsIs"
So here the whole matrix is "AsIs", and since matrix subscribting
probably
doesn't maintain "AsIs"> ddd[[1]][, 1]
[1] "a" "b" "c"
"AsIs" is gone.
Regards
--
Dr. Jens Oehlschlägel-Akiyoshi
MD FACTORY GmbH
Bayerstrasse 21
80335 München
Tel.: 089 545 28-27
Fax.: 089 545 28-10
http://www.mdfactory.de
Standard Disclaimers: Opinions expressed here are personal
and are not otherwise represented.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._