Jorgen Harmse
2024-Apr-08 17:03 UTC
[R] duplicated() on zero-column data frames returns empty
I appreciate the compliment from Ivan and still share the puzzlement at the empty return. What is the policy for changing something that is wrong? There is a trade-off between breaking old code that worked around a problem and breaking new code written by people who make reasonable assumptions. Mathematically, it seems obvious to me that duplicated.matrix(A) should do something like this: v <- matrix(FALSE, nrow = nrow(A) -> nr, ncol=1L) # or an ordinary vector? if (nr > 1L) # Check because 2:0 & 2:1 do not do what we want. { for (i in 2:nr) { for (j in 1:(i-1)) if (identical(A[i,],A[j,])) # or something more complicated to handle incomparables { v[i] <- TRUE; break} } } v Of course my code is horribly inefficient, but the difference should be just in computing the same result faster. An empty vector of some type is identical to an empty vector of the same type, so this computes [,1] [1,] FALSE [2,] TRUE [3,] TRUE [4,] TRUE [5,] TRUE , and I argue that that is correct. A gap in documentation makes a change to the correct behaviour easier. (If the current behaviour were documented then the first step in changing the behaviour would be to issue a warning that the change is coming in a future version.) The protection for old code could be just a warning that can be turned off with a call to options. The new documentation should be more explicit. Regards, Jorgen. From: Mark Webster <markwebster204 at yahoo.co.uk> To: Jorgen Harmse <jharmse at roku.com>, Ivan Krylov <ikrylov at disroot.org> Cc: "r-help at r-project.org" <r-help at r-project.org> Subject: Re: [R] duplicated() on zero-column data frames returns empty Message-ID: <603481690.9150754.1712522666289 at mail.yahoo.com> Content-Type: text/plain; charset="utf-8" duplicated.matrix is an interesting one. I think a similar change would make sense, because it would have the dimensions that people would expect when using the default MARGIN = 1. However, it could be argued that it's not a needed change, because the Value section of its documentation only guarantees the dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix does indeed return the expected 5x0 matrix for your example: str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ] Best Regards, Mark Webster [[alternative HTML version deleted]] From: Mark Webster markwebster204 at yahoo.co.uk<mailto:markwebster204 at yahoo.co.uk> To: Ivan Krylov ikrylov at disroot.org<mailto:ikrylov at disroot.org>, r-help at r-project.org<mailto:r-help at r-project.org> r-help at r-project.org<mailto:r-help at r-project.org> Subject: Re: [R] duplicated() on zero-column data frames returns empty vector Message-ID: 1379736116.7985600.1712306452176 at mail.yahoo.com<mailto:1379736116.7985600.1712306452176 at mail.yahoo.com> Content-Type: text/plain; charset="utf-8" Do you mean the row names should mean all the rows should be counted as non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm still puzzled at what interpretation would motivate the current behaviour of returning a logical(0), however. Date: Sun, 7 Apr 2024 11:00:51 +0300 From: Ivan Krylov <ikrylov at disroot.org<mailto:ikrylov at disroot.org>> To: Jorgen Harmse <JHarmse at roku.com<mailto:JHarmse at roku.com>> Cc: "r-help at r-project.org<mailto:r-help at r-project.org>" <r-help at r-project.org<mailto:r-help at r-project.org>>, "markwebster204 at yahoo.co.uk<mailto:markwebster204 at yahoo.co.uk>" <markwebster204 at yahoo.co.uk<mailto:markwebster204 at yahoo.co.uk>> Subject: Re: [R] duplicated() on zero-column data frames returns empty Message-ID: 20240407110051.7924c03c at Tarkus<mailto:20240407110051.7924c03c at Tarkus> Content-Type: text/plain; charset="utf-8" ? Fri, 5 Apr 2024 16:08:13 +0000 Jorgen Harmse <JHarmse at roku.com<mailto:JHarmse at roku.com>> ?????:> if duplicated really treated a row name as part of the row then > any(duplicated(data.frame(?))) would always be FALSE. My expectation > is that if key1 is a subset of key2 then all(duplicated(df[key1]) >> duplicated(df[key2])) should always be TRUE.That's a good argument, thank you! Would you suggest similar changes to duplicated.matrix too? Currently it too returns 0-length output for 0-column inputs: # 0-column matrix for 0-column input str(duplicated(matrix(0, 5, 0))) # logi[1:5, 0 ] # 1-column matrix for 1-column input str(duplicated(matrix(0, 5, 1))) # logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE # a dim-1 array for >1-column input str(duplicated(matrix(0, 5, 10))) # logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE -- Best regards, Ivan [[alternative HTML version deleted]]
Ivan Krylov
2024-May-12 15:54 UTC
[R] duplicated() on zero-column data frames returns empty
(Sorry for only getting back to this more than a month later.) ? Mon, 8 Apr 2024 17:03:00 +0000 Jorgen Harmse <JHarmse at roku.com> ?????:> What is the policy for changing something that is wrong? There is a > trade-off between breaking old code that worked around a problem and > breaking new code written by people who make reasonable assumptions.First of all, quantify the breakage. Does the proposed change break `make check-devel`? Does it break CRAN and BioConductor? (This one is hard to measure properly: someone will have to run >20000 R CMD checks times two, for "before the change" and "after the change".) Given a persuasive case, breaking changes can still be made, but will require a deprecation period to let the packages adjust. If you would like to try your hand at developing a patch and make a case for it at R-devel or the Bugzilla, the resources at <https://contributor.r-project.org/> can be helpful. -- Best regards, Ivan
Reasonably Related Threads
- duplicated() on zero-column data frames returns empty
- [EXTERNAL] RE: I need to create new variables based on two numeric variables and one dichotomize conditional category variables.
- duplicated() on zero-column data frames returns empty
- duplicated() on zero-column data frames returns empty
- duplicated() on zero-column data frames returns empty