On 11-May-08 18:58:45, Myers, Brent wrote:> There is a very useful and apparently fundamental feature of R
> (or of the package pls) which I don't understand.
>
> For datasets with many independent (X) variables such as chemometric
> datasets there is a convenient formula and dataframe construction
> that allows one to access the entire X matrix with a single term.
>
> Consider the gasoline dataset available in the pls package. For the
> model statement in the plsr function one can write: Octane ~ NIR
>
> NIR refers to a (wide) matrix which is a portion of a dataframe. The
> naming of the columns is of the form: 'NIR.xxxx nm'
>
> names(gasoline) returns...
>
> $names
> [1] "octane" "NIR"
>
> instead of...
>
> $names
> [1] "octane" "NIR.1000 nm" "NIR.1001 nm" ...
>
> How do I construct and manipulate such dataframes and the column
> names that go with?
>
> Does the use of these types of formulas and dataframes generalize
> to other modeling functions?
>
> Some specific clues on a help search might be enough, I've tried many.
>
> Regards,
> Brent
I don't have the 'gasoline' dataset to hand, but I can produce
something to which your descrption applies as follows:
C1 <- c(1.1,1.2,1.3,1.4)
C2 <- c(2.1,2.2,2.3,2.4)
M <- cbind(M1=c(11.1,11.2,11.3,11.4),
M2=c(12.1,12.2,12.3,12.4))
DF <- data.frame(C1=C1,C2=C2,M=M)
DF
# C1 C2 M.M1 M.M2
# 1 1.1 2.1 11.1 12.1
# 2 1.2 2.2 11.2 12.2
# 3 1.3 2.3 11.3 12.3
# 4 1.4 2.4 11.4 12.4
so the two columns C1 and C2 have gone in as named, and the
matrix M (with named columns M1 and M2) has gone in with
columns M.M1, M.M2
Now let's fuzz the numbers a bit, so that the lm() fit
makes sense:
C1 <- C1 + round(0.1*runif(4),2)
C1 <- C1 + round(0.1*runif(4),2)
M <- cbind(M1=c(11.1,11.2,11.3,11.4),
M2=c(12.1,12.2,12.3,12.4)) +
round(0.1*runif(8),2)
DF <- data.frame(C1=C1,C2=C2,M=M)
DF
# C1 C2 M.M1 M.M2
# 1 1.21 2.1 11.19 12.13
# 2 1.34 2.2 11.23 12.23
# 3 1.38 2.3 11.36 12.30
# 4 1.50 2.4 11.43 12.48
summary(lm(C1 ~ M),data=DF)
# Call:
# lm(formula = C1 ~ M)
# Residuals:
# 1 2 3 4
# -0.02422 0.02448 0.01309 -0.01335
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -8.28435 2.48952 -3.328 0.186
# MM1 -0.05411 0.66909 -0.081 0.949
# MM2 0.83463 0.50687 1.647 0.347
# Residual standard error: 0.03919 on 1 degrees of freedom
# Multiple R-Squared: 0.9642, Adjusted R-squared: 0.8925
# F-statistic: 13.46 on 2 and 1 DF, p-value: 0.1893
In other words, a perfectly standard LM fit, equivalent to
summary(lm(C1 ~ M[,1]+M[,2]))
(as you can check). So all that looks straightforward.
One thing, however, is not clear to me in this scenario.
Suppose, for example, that the columns M1 and M2 of M
were factors (and that you had more rows than I've used
above, so that the fit is non-trivial).
Then, in the standard specification of an LM, you could
write
summary(lm(C1 ~ M[,1]*M[,2]))
and get the main effects and interactions. But how would
you do that in the other type of specification:
Where you used
summary(lm(C1 ~ M, data=DF))
to get the equivalent of
summary(lm(C1 ~ M[,1]+M[,2]))
what would you use to get the equivalent of
summary(lm(C1 ~ M[,1]*M[,2]))??
Would you have to "spell out" the interaction term[s]
in additional columns of M?
Hmmm, interesting! I hadn't been aware of this aspect of
formula and dataframe construction for modellinng, until
you pointed it out!
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 11-May-08 Time: 21:06:49
------------------------------ XFMail ------------------------------