Martin Maechler
2005-Jan-08 17:19 UTC
[Rd] p.adjust(<NA>s), was "Re: [BioC] limma and p-values"
>>>>> "GS" == Gordon K Smyth <smyth@wehi.edu.au> >>>>> on Sat, 8 Jan 2005 01:11:30 +1100 (EST) writes:<.............> GS> p.adjust() unfortunately gives incorrect results when GS> 'p' includes NAs. The results from topTable are GS> correct. topTable() takes care to remove NAs before GS> passing the values to p.adjust(). There's at least one bug in p.adjust(): The "hommel" method currently does not work at all with NAs (and I have an uncommitted fix ready for this bug). OTOH, the current version of p.adjust() ``works'' with NA's, apart from Hommel's method, but by using "n = length(p)" in the correction formulae, i.e. *including* the NAs for determining sample size `n' {my fix to "hommel" would do this as well}. My question is what p.adjust() should do when there are NA's more generally, or more specifically which `n' to use in the correction formula. Your proposal amounts to ``drop NA's and forget about them till the very end'' (where they are wanted in the result), i.e., your sample size `n' would be sum(!is.na(p)) instead of length(p). To me it doesn't seem obvious that this setting "n = #{non-NA observations}" is desirable for all P-value adjustment methods. One argument for keeping ``n = #{all observations}'' at least for some correction methods is the following "continuity" one: If only a few ``irrelevant'' (let's say > 0.5) P-values are replaced by NA, the adjusted relevant small P-values shouldn't change much, ideally not at all. I'm really no scholar on this topic, but e.g. for "holm" I think I would want to keep ``full n'' because of the above continuity argument. BTW, for "fdr", I don't see a straightforward way to achieve the desired continuity. 5D Of course, p.adjust() could adopt the possibility of chosing how NA's should be treated e.g. by another argument ``use.na = TRUE/FALSE'' and hence allow both versions. Feedback very welcome, particularly from ``P-value experts'' ;-) Martin Maechler, ETH Zurich
Martin Maechler
2005-Jan-08 21:29 UTC
[Rd] p.adjust(<NA>s), was "Re: [BioC] limma and p-values"
I've thought more and made experiements with R code versions and just now committed a new version of p.adjust() to R-devel --> https://svn.r-project.org/R/trunk/src/library/stats/R/p.adjust.R which does sensible NA handling by default and *additionally* has an "na.rm" argument (set to FALSE by default). The extended 'Examples' secion on the help page https://svn.r-project.org/R/trunk/src/library/stats/man/p.adjust.Rd shows how the new NA handling is typically much more sensible than using "na.rm = TRUE". Martin>>>>> "MM" == Martin Maechler <maechler@stat.math.ethz.ch> >>>>> on Sat, 8 Jan 2005 17:19:23 +0100 writes:>>>>> "GS" == Gordon K Smyth <smyth@wehi.edu.au> >>>>> on Sat, 8 Jan 2005 01:11:30 +1100 (EST) writes:MM> <.............> GS> p.adjust() unfortunately gives incorrect results when GS> 'p' includes NAs. The results from topTable are GS> correct. topTable() takes care to remove NAs before GS> passing the values to p.adjust(). MM> There's at least one bug in p.adjust(): The "hommel" MM> method currently does not work at all with NAs (and I MM> have an uncommitted fix ready for this bug). OTOH, the MM> current version of p.adjust() ``works'' with NA's, apart MM> from Hommel's method, but by using "n = length(p)" in MM> the correction formulae, i.e. *including* the NAs for MM> determining sample size `n' {my fix to "hommel" would do MM> this as well}. MM> My question is what p.adjust() should do when there are MM> NA's more generally, or more specifically which `n' to MM> use in the correction formula. Your proposal amounts to MM> ``drop NA's and forget about them till the very end'' MM> (where they are wanted in the result), i.e., your sample MM> size `n' would be sum(!is.na(p)) instead of length(p). MM> To me it doesn't seem obvious that this setting "n MM> #{non-NA observations}" is desirable for all P-value MM> adjustment methods. One argument for keeping ``n = #{all MM> observations}'' at least for some correction methods is MM> the following "continuity" one: MM> If only a few ``irrelevant'' (let's say > 0.5) P-values MM> are replaced by NA, the adjusted relevant small P-values MM> shouldn't change much, ideally not at all. I'm really MM> no scholar on this topic, but e.g. for "holm" I think I MM> would want to keep ``full n'' because of the above MM> continuity argument. BTW, for "fdr", I don't see a MM> straightforward way to achieve the desired continuity. MM> 5D Of course, p.adjust() could adopt the possibility of MM> chosing how NA's should be treated e.g. by another MM> argument ``use.na = TRUE/FALSE'' and hence allow both MM> versions. MM> Feedback very welcome, particularly from ``P-value MM> experts'' ;-) MM> Martin Maechler, ETH Zurich