thr3ads.net - R help - [R] duplicated() on zero-column data frames returns empty [Apr 2024]

If this information is useful, please help other people find it:
Share via:

Jorgen Harmse

2024-Apr-08 17:03 UTC

[R] duplicated() on zero-column data frames returns empty

I appreciate the compliment from Ivan and still share the puzzlement at the
empty return.

What is the policy for changing something that is wrong? There is a trade-off
between breaking old code that worked around a problem and breaking new code
written by people who make reasonable assumptions. Mathematically, it seems
obvious to me that duplicated.matrix(A) should do something like this:

v <- matrix(FALSE, nrow = nrow(A) -> nr, ncol=1L) # or an ordinary vector?
if (nr > 1L) # Check because 2:0 & 2:1 do not do what we want.
{ for (i in 2:nr)
  { for (j in 1:(i-1))
    if (identical(A[i,],A[j,])) # or something more complicated to handle
incomparables
    { v[i] <- TRUE; break}
  }
}
v

Of course my code is horribly inefficient, but the difference should be just in
computing the same result faster. An empty vector of some type is identical to
an empty vector of the same type, so this computes

      [,1]

[1,] FALSE

[2,]  TRUE

[3,]  TRUE

[4,]  TRUE

[5,]  TRUE
, and I argue that that is correct.

A gap in documentation makes a change to the correct behaviour easier. (If the
current behaviour were documented then the first step in changing the behaviour
would be to issue a warning that the change is coming in a future version.) The
protection for old code could be just a warning that can be turned off with a
call to options. The new documentation should be more explicit.

Regards,
Jorgen.

From: Mark Webster <markwebster204 at yahoo.co.uk>
To: Jorgen Harmse <jharmse at roku.com>, Ivan Krylov
        <ikrylov at disroot.org>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: <603481690.9150754.1712522666289 at mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 duplicated.matrix is an interesting one. I think a similar change would make
sense, because it would have the dimensions that people would expect when using
the default MARGIN = 1. However, it could be argued that it's not a needed
change, because the Value section of its documentation only guarantees the
dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix
does indeed return the expected 5x0 matrix for your example:
str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ]
Best Regards,
Mark Webster
        [[alternative HTML version deleted]]

From: Mark Webster markwebster204 at yahoo.co.uk<mailto:markwebster204 at
yahoo.co.uk>
To: Ivan Krylov ikrylov at disroot.org<mailto:ikrylov at disroot.org>, 
r-help at r-project.org<mailto:r-help at r-project.org>
        r-help at r-project.org<mailto:r-help at r-project.org>
Subject: Re: [R]  duplicated() on zero-column data frames returns
        empty vector
Message-ID: 1379736116.7985600.1712306452176 at
mail.yahoo.com<mailto:1379736116.7985600.1712306452176 at mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 Do you mean the row names should mean all the rows should be counted as
non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm
still puzzled at what interpretation would motivate the current behaviour of
returning a logical(0), however.

Date: Sun, 7 Apr 2024 11:00:51 +0300
From: Ivan Krylov <ikrylov at disroot.org<mailto:ikrylov at
disroot.org>>
To: Jorgen Harmse <JHarmse at roku.com<mailto:JHarmse at roku.com>>
Cc: "r-help at r-project.org<mailto:r-help at r-project.org>"
<r-help at r-project.org<mailto:r-help at r-project.org>>,
        "markwebster204 at yahoo.co.uk<mailto:markwebster204 at
yahoo.co.uk>" <markwebster204 at yahoo.co.uk<mailto:markwebster204
at yahoo.co.uk>>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: 20240407110051.7924c03c at Tarkus<mailto:20240407110051.7924c03c
at Tarkus>
Content-Type: text/plain; charset="utf-8"

? Fri, 5 Apr 2024 16:08:13 +0000
Jorgen Harmse <JHarmse at roku.com<mailto:JHarmse at roku.com>>
?????:
> if duplicated really treated a row name as part of the row then
> any(duplicated(data.frame(?))) would always be FALSE. My expectation
> is that if key1 is a subset of key2 then all(duplicated(df[key1]) >>
duplicated(df[key2])) should always be TRUE.
That's a good argument, thank you!

Would you suggest similar changes to duplicated.matrix too? Currently
it too returns 0-length output for 0-column inputs:

# 0-column matrix for 0-column input
str(duplicated(matrix(0, 5, 0)))
# logi[1:5, 0 ]

# 1-column matrix for 1-column input
str(duplicated(matrix(0, 5, 1)))
# logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE

# a dim-1 array for >1-column input
str(duplicated(matrix(0, 5, 10)))
# logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE

--
Best regards,
Ivan




	[[alternative HTML version deleted]]

Ivan Krylov

2024-May-12 15:54 UTC

head link

[R] duplicated() on zero-column data frames returns empty

(Sorry for only getting back to this more than a month later.)

? Mon, 8 Apr 2024 17:03:00 +0000
Jorgen Harmse <JHarmse at roku.com> ?????:
> What is the policy for changing something that is wrong? There is a
> trade-off between breaking old code that worked around a problem and
> breaking new code written by people who make reasonable assumptions.
First of all, quantify the breakage. Does the proposed change break
`make check-devel`? Does it break CRAN and BioConductor? (This one is
hard to measure properly: someone will have to run >20000 R CMD checks
times two, for "before the change" and "after the change".)
Given a
persuasive case, breaking changes can still be made, but will require a
deprecation period to let the packages adjust.

If you would like to try your hand at developing a patch and make a
case for it at R-devel or the Bugzilla, the resources at
<https://contributor.r-project.org/> can be helpful.

-- 
Best regards,
Ivan

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Apr 2024 - duplicated() on zero-column data frames returns empty

[R] duplicated() on zero-column data frames returns empty

[R] duplicated() on zero-column data frames returns empty

Apparently Analagous Threads