Hi all,
I have a question about linear model with interaction:
I created a data frame df like this:
>df
V1 V2 V3 V4 V5
1 6.414094 c t a g
2 6.117286 t a g t
3 5.756922 a g t g
4 6.090402 g t g t
...
which holds the response in the first column and letters (a,c,g,t) in the
other columns. I am interested to see if there are interactions between the
neigbouring letters so I have defined the following linear model:
>lm<-lm(df[,1] ~ (df[,2]:df[,3]) + (df[,3]:df[,4]) + (df[,4]:df[,5]) )
the result then looks like this:
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8987163 0.0211457 420.828 < 2e-16 ***
df[, 2]a:df[, 3]a -0.1021543 0.0253486 -4.030 5.59e-05 ***
df[, 2]c:df[, 3]a 0.0535562 0.0255685 2.095 0.036213 *
df[, 2]g:df[, 3]a 0.0224073 0.0318965 0.703 0.482372
df[, 2]t:df[, 3]a 0.0024165 0.0259862 0.093 0.925911
df[, 2]a:df[, 3]c 0.0355502 0.0260197 1.366 0.171861
df[, 2]c:df[, 3]c 0.0433014 0.0252535 1.715 0.086415 .
df[, 2]g:df[, 3]c 0.1472222 0.0309441 4.758 1.97e-06 ***
df[, 2]t:df[, 3]c 0.0613779 0.0270601 2.268 0.023323 *
df[, 2]a:df[, 3]g 0.0646498 0.0299286 2.160 0.030770 *
df[, 2]c:df[, 3]g 0.1302731 0.0359439 3.624 0.000290 ***
df[, 2]g:df[, 3]g 0.1512754 0.0360951 4.191 2.78e-05 ***
df[, 2]t:df[, 3]g 0.1084278 0.0339142 3.197 0.001389 **
df[, 2]a:df[, 3]t -0.0249016 0.0262402 -0.949 0.342633
df[, 2]c:df[, 3]t 0.0860302 0.0253518 3.393 0.000691 ***
df[, 2]g:df[, 3]t 0.0241031 0.0358496 0.672 0.501372
df[, 2]t:df[, 3]t NA NA NA NA
df[, 3]a:df[, 4]1 -0.0970149 0.0143730 -6.750 1.50e-11 ***
df[, 3]c:df[, 4]1 -0.0153732 0.0152519 -1.008 0.313486
df[, 3]g:df[, 4]1 -0.0706682 0.0225665 -3.132 0.001740 **
df[, 3]t:df[, 4]1 -0.0581889 0.0158485 -3.672 0.000241 ***
df[, 3]a:df[, 4]2 0.0485333 0.0150167 3.232 0.001231 **
df[, 3]c:df[, 4]2 -0.0790008 0.0150513 -5.249 1.54e-07 ***
df[, 3]g:df[, 4]2 0.0604465 0.0217557 2.778 0.005465 **
df[, 3]t:df[, 4]2 0.0232283 0.0167224 1.389 0.164826
df[, 3]a:df[, 4]3 0.0740046 0.0182221 4.061 4.89e-05 ***
df[, 3]c:df[, 4]3 0.0797502 0.0234485 3.401 0.000672 ***
df[, 3]g:df[, 4]3 0.0720160 0.0253456 2.841 0.004495 **
df[, 3]t:df[, 4]3 0.0778484 0.0221196 3.519 0.000433 ***
df[, 4]a:df[, 5]1 -0.0916618 0.0143707 -6.378 1.81e-10 ***
df[, 4]c:df[, 5]1 -0.0138048 0.0152609 -0.905 0.365691
df[, 4]g:df[, 5]1 -0.0700765 0.0225639 -3.106 0.001900 **
df[, 4]t:df[, 5]1 -0.0734513 0.0158534 -4.633 3.62e-06 ***
df[, 4]a:df[, 5]2 0.0438002 0.0150128 2.918 0.003531 **
df[, 4]c:df[, 5]2 -0.1107056 0.0150634 -7.349 2.04e-13 ***
df[, 4]g:df[, 5]2 0.0652739 0.0217520 3.001 0.002694 **
df[, 4]t:df[, 5]2 0.0219305 0.0167259 1.311 0.189811
df[, 4]a:df[, 5]3 0.0804106 0.0182290 4.411 1.03e-05 ***
df[, 4]c:df[, 5]3 0.0970780 0.0234745 4.135 3.55e-05 ***
df[, 4]g:df[, 5]3 0.0704516 0.0253372 2.781 0.005430 **
df[, 4]t:df[, 5]3 0.0911914 0.0221237 4.122 3.77e-05 ***
questions:
1.) What could be the reason that the lm function changes the names of the
interactions terms (after the first undefined coefficient)
from a:a, c:a, g:a, ... to a:1, c:1, g:1, ... and obviously omits direct
calculation of interaction terms of the form a:4, c:4, g:4, t:4 which
(if I correctly assume) correspond to a:t, c:t, g:t, t:t.
2.) How I have to correctly define a data frame new_df for a new sequence of
letters to get the predicted response by using the predict function, I tried
something like this:
>new_df[2:5]=as.data.frame(t('g'))
>new_df[1]=0
>predict(lm, new_df)
and also the original data frame which was used to fit the
model:>predict(lm, df[1,])
outputs all predicted values with respect to the previously fitted linear
model and gives Warning messages:
1: 'newdata' had 1 rows but variable(s) found have 7020 rows
2: In predict.lm(lm_pm, new_df) :
prediction from a rank-deficient fit may be misleading
Thanks for any help,
Marian
[[alternative HTML version deleted]]