thr3ads.net - R devel - [Rd] rnorm is not truly random used in the lm function [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Victor Tian

2017-Aug-03 13:49 UTC

[Rd] rnorm is not truly random used in the lm function

To whom it may concern,

I happened to run the following R code just to check the layout of the
output, but found that the code doesn't work the way I thought it should
work.

''> lm(rnorm(100) ~ rnorm(100))
Call:
lm(formula = rnorm(100) ~ rnorm(100))

Coefficients:
(Intercept)
   -0.07966

Warning messages:
1: In model.matrix.default(mt, mf, contrasts) :
  the response appeared on the right-hand side and was dropped
2: In model.matrix.default(mt, mf, contrasts) :
  problem with term 1 in model.matrix: no columns are assigned
"

It appears that rnorm(100) produces the same array of numbers on both sides
of the ~ sign.

It can be further verified by having the same error message if we do x <-
rnorm(100) and lm(x ~ x).

I would expect the two rnorm(100) functions in the lm function return two
different arrays of numbers, but am open to hear reasons from the other
side.

Thanks,

-- 
*Xu Tian*

	[[alternative HTML version deleted]]

Martin Maechler

2017-Aug-03 16:11 UTC

head link

[Rd] rnorm is not truly random used in the lm function

>>>>> Victor Tian <tianxu03 at gmail.com>
>>>>>     on Thu, 3 Aug 2017 09:49:57 -0400 writes:
    > To whom it may concern,
    > I happened to run the following R code just to check the layout of the
    > output, but found that the code doesn't work the way I thought it
should
    > work.

yes, your expectations were wrong.

    >> lm(rnorm(100) ~ rnorm(100))

    > Call:
    > lm(formula = rnorm(100) ~ rnorm(100))

    > Coefficients:
    > (Intercept)
    > -0.07966

    > Warning messages:
    > 1: In model.matrix.default(mt, mf, contrasts) :
    > the response appeared on the right-hand side and was dropped
    > 2: In model.matrix.default(mt, mf, contrasts) :
    > problem with term 1 in model.matrix: no columns are assigned

    > It appears that rnorm(100) produces the same array of numbers on both
sides
    > of the ~ sign.

Indeed.  And all this has nothing to do with lm()  but rather with
how formulas in R have been treated probably "forever".
[I assume not only in R, but rather since the time formulas
 where introduced into the S language (for "S version 3") a few
 years before R was born. But I can no longer verify or disprove
 this assumption.] 

Even more revealing may be this:
> f <- rnorm(9) ~ rnorm(9)
> str(f)Class 'formula'  language rnorm(9) ~ rnorm(9)
  ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
> (mm <- model.matrix(f))  (Intercept)
1           1
2           1
3           1
4           1
5           1
6           1
7           1
8           1
9           1
attr(,"assign")
[1] 0
Warning messages:
1: In model.matrix.default(f) :
  the response appeared on the right-hand side and was dropped
2: In model.matrix.default(f) :
  problem with term 1 in model.matrix: no columns are
assigned> ---------

BTW: One of the goals of formulas,  notably in R since they got an
environment attached, is a clean way to deal with non-standard
evaluation (=: NSE).
[ Some of us would claim it is the only clean way to deal with NSE in R,
  and all new functionality using NSE should use formulas,
  but recently tidyverse-scholars have claimed to be able to deal
  with it cleanly w/o the use of formulas, but via "tidy evaluation" ]

Using random expressions in a formula is therefore typically not
a good idea, because you don't realy know when the terms in the
formula will be evaluated.
For lm() and all other good formula-based statistical modeling
functions, the evaluation happens via model.matrix().

As you've noticed from that warning, model.matrix() tries to
help the user by checking terms and eliminating those that
appear on both sides of the '~'.
This has been documented on the help page [ ?model.matrix ] for
(almost exactly 14) years, the "Details:" section ending with

 _> By convention, if the response variable also appears on the
 _> right-hand side of the formula it is dropped (with a warning),
 _> although interactions involving the term are retained.

I hope this explains the issue.
And yes:  Do *not* use rnorm() in formulas.

Martin

--
Martin M?chler 
Seminar f?r Statistik, ETH Z?rich //  R Core Team

Victor Tian

2017-Aug-03 16:33 UTC

head link

[Rd] rnorm is not truly random used in the lm function

I did it purely based on the intuition I built from elsewhere and maybe in
R as well.

To summarise, it's basically a matter of evaluation ordering issue.

It looks like the model.matrix() function has a higher precedence over
rnorm(100), i.e., outside in rather than inside out in this specific case?
If the inner parts are evaluated first, as in most of the cases, the two
norm(100) expressions will no longer be the same.

I guess it's because they appear the same to model.matrix()? This would
raise another question, how does model.matrix() judges if two variables are
the same on both sides of the ~ sign? By the input literal?

Please clarify.

Thanks,
Victor


On Thu, Aug 3, 2017 at 12:11 PM, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:
> >>>>> Victor Tian <tianxu03 at gmail.com>
> >>>>>     on Thu, 3 Aug 2017 09:49:57 -0400 writes:
>
>     > To whom it may concern,
>     > I happened to run the following R code just to check the layout of
> the
>     > output, but found that the code doesn't work the way I thought
it
> should
>     > work.
>
> yes, your expectations were wrong.
>
>     >> lm(rnorm(100) ~ rnorm(100))
>
>     > Call:
>     > lm(formula = rnorm(100) ~ rnorm(100))
>
>     > Coefficients:
>     > (Intercept)
>     > -0.07966
>
>     > Warning messages:
>     > 1: In model.matrix.default(mt, mf, contrasts) :
>     > the response appeared on the right-hand side and was dropped
>     > 2: In model.matrix.default(mt, mf, contrasts) :
>     > problem with term 1 in model.matrix: no columns are assigned
>
>
>     > It appears that rnorm(100) produces the same array of numbers on
> both sides
>     > of the ~ sign.
>
> Indeed.  And all this has nothing to do with lm()  but rather with
> how formulas in R have been treated probably "forever".
> [I assume not only in R, but rather since the time formulas
>  where introduced into the S language (for "S version 3") a few
>  years before R was born. But I can no longer verify or disprove
>  this assumption.]
>
> Even more revealing may be this:
>
> > f <- rnorm(9) ~ rnorm(9)
> > str(f)
> Class 'formula'  language rnorm(9) ~ rnorm(9)
>   ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
> > (mm <- model.matrix(f))
>   (Intercept)
> 1           1
> 2           1
> 3           1
> 4           1
> 5           1
> 6           1
> 7           1
> 8           1
> 9           1
> attr(,"assign")
> [1] 0
> Warning messages:
> 1: In model.matrix.default(f) :
>   the response appeared on the right-hand side and was dropped
> 2: In model.matrix.default(f) :
>   problem with term 1 in model.matrix: no columns are assigned
> >
> ---------
>
> BTW: One of the goals of formulas,  notably in R since they got an
> environment attached, is a clean way to deal with non-standard
> evaluation (=: NSE).
> [ Some of us would claim it is the only clean way to deal with NSE in R,
>   and all new functionality using NSE should use formulas,
>   but recently tidyverse-scholars have claimed to be able to deal
>   with it cleanly w/o the use of formulas, but via "tidy
evaluation" ]
>
> Using random expressions in a formula is therefore typically not
> a good idea, because you don't realy know when the terms in the
> formula will be evaluated.
> For lm() and all other good formula-based statistical modeling
> functions, the evaluation happens via model.matrix().
>
> As you've noticed from that warning, model.matrix() tries to
> help the user by checking terms and eliminating those that
> appear on both sides of the '~'.
> This has been documented on the help page [ ?model.matrix ] for
> (almost exactly 14) years, the "Details:" section ending with
>
>  _> By convention, if the response variable also appears on the
>  _> right-hand side of the formula it is dropped (with a warning),
>  _> although interactions involving the term are retained.
>
>
> I hope this explains the issue.
> And yes:  Do *not* use rnorm() in formulas.
>
> Martin
>
> --
> Martin M?chler
> Seminar f?r Statistik, ETH Z?rich //  R Core Team
>


-- 
*Xu Tian*

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Aug 2017 - rnorm is not truly random used in the lm function

[Rd] rnorm is not truly random used in the lm function

[Rd] rnorm is not truly random used in the lm function

[Rd] rnorm is not truly random used in the lm function

Apparently Analagous Threads