Hi R gurus,
We do a lot of work with biological -omics datasets (genomics, proteomics etc).
The text file inputs to R typically contain a mixture of (mostly) character data
and numeric data. The number of columns (both character and numeric data) in
the file vary with the number of samples measured (which makes use of colClasses
, so a typical approach might be
1) read in the whole file as character matrix
#simulated result of read.table (with stringsAsFactors=FALSE)
raw <-
data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular
tumor antigen p53','Tumor necrosis
factor','Trypsin'),Species=c('Homo sapiens','Homo
sapiens','Sus
scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))
2) use grep to identify numeric columns based on column names and split the raw
matrix
QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
META_COLS <- !QUANT_COLS
quant.df.char <- raw[,QUANT_COLS]
meta.df <- raw[, META_COLS]
3) convert the quantitation data frame to a numeric matrix
Prior to R version 4, my standard method for step 3 was to use data.matrix() for
this last step. After recently updating from v3.6.3, I've found that all my
workflows using this function were giving wildly incorrect results. I figured
out that data.matrix now yields a matrix of factor levels rather than the actual
numeric values
> quant.df.char
Intensity.SampleA Intensity.SampleB Intensity.SampleC
1 919948 1625540 1232780
2 1346170 710272 1481040
3 15870 83624 62548
> data.matrix(quant.df.char)
Intensity.SampleA Intensity.SampleB Intensity.SampleC
[1,] 3 1 1
[2,] 1 2 2
[3,] 2 3 3
The change in behaviour of this function is documented in the R v4.0.0
changelog, so it is clearly intentional:
"data.matrix() now converts character columns to factors and from this to
integers."
Now, I know there are other ways to achieve the same conversion, e.g.
sapply(quant.df.char, as.numeric). They aren't quite as straightforward to
read in the code as data.matrix (sapply/lapply in particular I have to think
though whether there will a need to transpose the result!), but the fact that
this base function has been changed (without a way to replicate the previous
behaviour) leads me to suspect that I have probably not previously been using
data.matrix in the intended manner - and I may therefore be making similar
mistakes elsewhere! I've certainly distributed/handed out R scripting
examples in the past that will now give incorrect results when run on v4+ R.
What even more confusing to me (but possibly related as regards an answer) is
that R v4 broke with long-standing convention to change
default.stringsAsFactors() to FALSE. So on one hand the update took away what
was (at least, from our perspective, with our data - I am sure some here may
disagree!) a perennial source of confusion/bugs for R learners, by not
introducing string factorisation during data import, and then on the other hand
changed a base function to explicitly introduce string factorisation.. I
can't see when converting a character dataset, not to factors but, straight
to numeric factor levels might be that useful (but of course that doesn't
mean it isn't!).
I've had a look through r-help and r-devel archives and couldn't spot
any discussion of this, so apologies if this has been asked before. I'm also
pretty sure my misunderstanding is with the intended use-case of data.matrix and
R ethos around strings/factors rather than the rationale for the change, which
is why I'm asking here.
Best wishes,
Phil
Philip Charles
Target Discovery Institute, Nuffield Department Of Medicine
University of Oxford
[[alternative HTML version deleted]]