Balise, Raymond R
2021-Dec-27 01:35 UTC
[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?
Hello R folks, Today I noticed that using the subset argument in lm() with a polynomial gives a different result than using the polynomial when the data has already been subsetted. This was not at all intuitive for me. You can see an example here: https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i If this is a design feature that you don?t think should be fixed, can you please include it in the documentation and explain why it makes sense to figure out the orthogonal polynomials on the entire dataset? This feels like a serous leak of information when evaluating train and test datasets in a statistical learning framework. Ray Raymond R. Balise, PhD Assistant Professor Department of Public Health Sciences, Biostatistics University of Miami, Miller School of Medicine 1120 N.W. 14th Street Don Soffer Clinical Research Center - Room 1061 Miami, Florida 33136 [[alternative HTML version deleted]]
Ben Bolker
2021-Dec-27 14:43 UTC
[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?
I agree that it seems non-intuitive (I can't think of a design reason for it to look this way), but I'd like to stress that it's *not* an information leak; the predictions of the model are independent of the parameterization, which is all this issue affects. In a worst case there might be some unfortunate effects on numerical stability if the data-dependent bases are computed on a very different set of data than the model fitting actually uses. I've attached a suggested documentation patch (I hope it makes it through to the list, if not I can add it to the body of a message.) On 12/26/21 8:35 PM, Balise, Raymond R wrote:> Hello R folks, > Today I noticed that using the subset argument in lm() with a polynomial gives a different result than using the polynomial when the data has already been subsetted. This was not at all intuitive for me. You can see an example here: https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i > > If this is a design feature that you don?t think should be fixed, can you please include it in the documentation and explain why it makes sense to figure out the orthogonal polynomials on the entire dataset? This feels like a serous leak of information when evaluating train and test datasets in a statistical learning framework. > > Ray > > Raymond R. Balise, PhD > Assistant Professor > Department of Public Health Sciences, Biostatistics > > University of Miami, Miller School of Medicine > 1120 N.W. 14th Street > Don Soffer Clinical Research Center - Room 1061 > Miami, Florida 33136 > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering Graduate chair, Mathematics & Statistics -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: subset_patch.txt URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20211227/47cc2f5a/attachment.txt>
Martin Maechler
2022-Jan-03 15:54 UTC
[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?
>>>>> Ben Bolker >>>>> on Mon, 27 Dec 2021 09:43:42 -0500 writes:> I agree that it seems non-intuitive (I can't think of a > design reason for it to look this way), but I'd like to > stress that it's *not* an information leak; the > predictions of the model are independent of the > parameterization, which is all this issue affects. In a > worst case there might be some unfortunate effects on > numerical stability if the data-dependent bases are > computed on a very different set of data than the model > fitting actually uses. > I've attached a suggested documentation patch (I hope > it makes it through to the list, if not I can add it to > the body of a message.) It did make it through; thank you, Ben! ( After adding two forgotten '}' ) I've committed the help file additions to the R sources (R-devel) in svn r81434 . Thanks again and "Happy New Year" to all readers, Martin > On 12/26/21 8:35 PM, Balise, Raymond R wrote: >> Hello R folks, Today I noticed that using the subset >> argument in lm() with a polynomial gives a different >> result than using the polynomial when the data has >> already been subsetted. This was not at all intuitive for >> me. You can see an example here: >> https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i >> >> If this is a design feature that you don?t think should >> be fixed, can you please include it in the documentation >> and explain why it makes sense to figure out the >> orthogonal polynomials on the entire dataset? This feels >> like a serous leak of information when evaluating train >> and test datasets in a statistical learning framework. >> >> Ray >> >> Raymond R. Balise, PhD Assistant Professor Department of >> Public Health Sciences, Biostatistics >> >> University of Miami, Miller School of Medicine 1120 >> N.W. 14th Street Don Soffer Clinical Research Center - >> Room 1061 Miami, Florida 33136 >> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > -- > Dr. Benjamin Bolker Professor, Mathematics & Statistics > and Biology, McMaster University Director, School of > Computational Science and Engineering Graduate chair, > Mathematics & Statistics x[DELETED ATTACHMENT external: > BenB_lm-subset.patch, plain text] > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel