Martin Maechler
2025-Aug-25 15:56 UTC
[Rd] Do rowMeans and colMeans of complex vars need adjusting following r88444?
>>>>> Dirk Eddelbuettel >>>>> on Mon, 25 Aug 2025 05:46:56 -0500 writes:> On 25 August 2025 at 11:54, Martin Maechler wrote: > | I still agree we should address this: We do have a discrepancy > | with mean() i.e., mean.default() which does "exactly" what you > | do "by hand" above, and hence using is.na() for the full > | complex vector, and not *separately* for Re() and Im() parts; > | ... and I do tend to agree that colMeans(*, na.rm=TRUE) etc > | probably should be adapted to *not* work coordinate-wise but > | drop all "complex NAs" both for Re and Im. > I agree with that. > The culprit is the per-component call (here from colMeans, same for rowMeans) > z <- if(is.complex(x)) > .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + > 1i * .Internal(colMeans(Im(x), n, prod(dn), na.rm)) > else .Internal(colMeans(x, n, prod(dn), na.rm)) > where we may need to add a na.omit() if na.rm is TRUE. unfortunately, no: na.omit() cannot be the solution easily for _both_ colMeans() and rowMeans() { colSums() & rowSums() ! which already make use of na.rm, as indeed na.omit() treats matrices as data matrices, i.e., rows treated very differently than columns, but in any case, it would drop full rows (or columns if you try to transpose first) and that may be very wrong as different rows/columns may have NAs in verydifferent locations. I really think a solution must happen in C code. Further, note that the if(is.complex(x)) .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + 1i * .Internal(colMeans(Im(x), n, prod(dn), na.rm)) with na.rm = TRUE has been *wrong* always, for all cases where is.na(Re(x)) and is.na(Im(x)) have differed ... So this problem is really a new one, only tangentially related to the original PR #18918 and also only loosely related Mikael's / my patch to fix PR #18918. As you have unearthed it, I'm fine if you open an R bugzilla issue for it; otherwise, I'd do it myself. Martin
Dirk Eddelbuettel
2025-Aug-25 22:09 UTC
[Rd] Do rowMeans and colMeans of complex vars need adjusting following r88444?
On 25 August 2025 at 17:56, Martin Maechler wrote: | >>>>> Dirk Eddelbuettel | >>>>> on Mon, 25 Aug 2025 05:46:56 -0500 writes: | I really think a solution must happen in C code. I tend to agree here. If I had seen an easy and obvious path to a fix for this, I would have proposed a patch from the outset. But as I did not, this email thread started instead to begin with a discussion. An 'obvious' solution is the only one I can think of now: we could pass both the real and the imaginary parts down to compute the mean of the imaginary part. A new one-off function just for complex data seems like overkill but maybe it is the way to go (doing one pass and computing real and imaginary parts in that pass). | Further, note that the | | if(is.complex(x)) | .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + | 1i * .Internal(colMeans(Im(x), n, prod(dn), na.rm)) | | with na.rm = TRUE has been *wrong* always, for all cases where | is.na(Re(x)) and | is.na(Im(x)) have differed ... Prior to r88444, the NA assignment for complex data assigned to both the real and imaginary part, making it unlikely both differed. As best as I can tell a user would have had to explicitly set that. | So this problem is really a new one, only tangentially related to the | original PR #18918 and also only loosely related Mikael's / my | patch to fix PR #18918. I see this differently. Prior to r88444 / #18918 we could reliably call rowMeans() and colMeans() on complex-valued matrices containing NA values. A unit test for Rcpp relied on it (to validate its own row/col means method). Following r88444 this test in Rcpp now breaks. And it breaks because the R side is now incorrect as we pass the imaginary alone, with a finite value of zero where the real part has a NA rendering the divisor for the mean values invalid. It really looks like a bug that is a consequence of r88444 to me. Note that R also did not have a test case for row or column means of complex valued data with NA or else this would have bubbled up sooner. So apart from fixing the issue we should probably add a test too. | As you have unearthed it, I'm fine if you open an R bugzilla | issue for it; otherwise, I'd do it myself. It doesn't matter to me. Maybe it is easier if you do it. r88444 was a good change. Everybody seems to agree that assigning NA only to the real part is cleaner. But we are having a small side effect here so let's see about fixing that too. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org