thr3ads.net - R devel - [Rd] Why does lm() with the subset argument give a different answer than subsetting in advance? [Dec 2021]

If this information is useful, please help other people find it:
Share via:

Balise, Raymond R

2021-Dec-27 01:35 UTC

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives a
different result than using the polynomial when the data has already been
subsetted. This was not at all intuitive for me.    You can see an example here:
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i

                If this is a design feature that you don?t think should be
fixed, can you please include it in the documentation and explain why it makes
sense to figure out the orthogonal polynomials on the entire dataset?  This
feels like a serous leak of information when evaluating train and test datasets
in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136



	[[alternative HTML version deleted]]

Ben Bolker

2021-Dec-27 14:43 UTC

head link

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

I agree that it seems non-intuitive (I can't think of a design reason 
for it to look this way), but I'd like to stress that it's *not* an 
information leak; the predictions of the model are independent of the 
parameterization, which is all this issue affects. In a worst case there 
might be some unfortunate effects on numerical stability if the 
data-dependent bases are computed on a very different set of data than 
the model fitting actually uses.

   I've attached a suggested documentation patch (I hope it makes it 
through to the list, if not I can add it to the body of a message.)



On 12/26/21 8:35 PM, Balise, Raymond R wrote:> Hello R folks,
> Today I noticed that using the subset argument in lm() with a polynomial
gives a different result than using the polynomial when the data has already
been subsetted. This was not at all intuitive for me.    You can see an example
here:
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
> 
>                  If this is a design feature that you don?t think should be
fixed, can you please include it in the documentation and explain why it makes
sense to figure out the orthogonal polynomials on the entire dataset?  This
feels like a serous leak of information when evaluating train and test datasets
in a statistical learning framework.
> 
> Ray
> 
> Raymond R. Balise, PhD
> Assistant  Professor
> Department of Public Health Sciences, Biostatistics
> 
> University of Miami, Miller School of Medicine
> 1120 N.W. 14th Street
> Don Soffer Clinical Research Center - Room 1061
> Miami, Florida 33136
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
Graduate chair, Mathematics & Statistics

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: subset_patch.txt
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20211227/47cc2f5a/attachment.txt>

Martin Maechler

2022-Jan-03 15:54 UTC

head link

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

>>>>> Ben Bolker 
>>>>>     on Mon, 27 Dec 2021 09:43:42 -0500 writes:
    >    I agree that it seems non-intuitive (I can't think of a
    > design reason for it to look this way), but I'd like to
    > stress that it's *not* an information leak; the
    > predictions of the model are independent of the
    > parameterization, which is all this issue affects. In a
    > worst case there might be some unfortunate effects on
    > numerical stability if the data-dependent bases are
    > computed on a very different set of data than the model
    > fitting actually uses.

    >    I've attached a suggested documentation patch (I hope
    > it makes it through to the list, if not I can add it to
    > the body of a message.)

It did make it through;  thank you, Ben!
( After adding two forgotten '}' ) I've committed the help file
additions to the R sources (R-devel) in svn r81434 .

Thanks again and

       "Happy New Year"

to all readers,

Martin




    > On 12/26/21 8:35 PM, Balise, Raymond R wrote:
    >> Hello R folks, Today I noticed that using the subset
    >> argument in lm() with a polynomial gives a different
    >> result than using the polynomial when the data has
    >> already been subsetted. This was not at all intuitive for
    >> me.  You can see an example here:
    >>
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
    >> 
    >> If this is a design feature that you don?t think should
    >> be fixed, can you please include it in the documentation
    >> and explain why it makes sense to figure out the
    >> orthogonal polynomials on the entire dataset?  This feels
    >> like a serous leak of information when evaluating train
    >> and test datasets in a statistical learning framework.
    >> 
    >> Ray
    >> 
    >> Raymond R. Balise, PhD Assistant Professor Department of
    >> Public Health Sciences, Biostatistics
    >> 
    >> University of Miami, Miller School of Medicine 1120
    >> N.W. 14th Street Don Soffer Clinical Research Center -
    >> Room 1061 Miami, Florida 33136
    >> 
    >> 
    >> 
    >> [[alternative HTML version deleted]]
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 

    > -- 
    > Dr. Benjamin Bolker Professor, Mathematics & Statistics
    > and Biology, McMaster University Director, School of
    > Computational Science and Engineering Graduate chair,
    > Mathematics & Statistics x[DELETED ATTACHMENT external:
    > BenB_lm-subset.patch, plain text]
    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Dec 2021 - Why does lm() with the subset argument give a different answer than subsetting in advance?

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?