Jorgen Harmse
2024-Apr-05  16:08 UTC
[R] duplicated() on zero-column data frames returns empty
(I do not know how to make Outlook send plain text, so I avoid apostrophes.)
For what it is worth, I agree with Mark Webster. The discussion by Ivan Krylov
is interesting, but if duplicated really treated a row name as part of the row
then any(duplicated(data.frame(?))) would always be FALSE. My expectation is
that if key1 is a subset of key2 then all(duplicated(df[key1]) >=
duplicated(df[key2])) should always be TRUE.
Incidentally, the examples for duplicated and the documentation of unique hint
that unique(x) is the same as (but more efficient than) x[!duplicated(x)] (for a
vector) or x[!duplicated(x)],,drop=FALSE] (for a data frame), and this seems to
be true even in the corner case (with what I consider incorrect output from both
functions) . On the other hand, I do not see any explicit guarantee about the
order of entries in unique(x) (or setdiff(?) or intersect(?)). Code using these
functions could be more efficient with explicit guarantees, but maybe the core
team wants to preserve its own flexibility. My suggestion is to include some
options so users can at least lock in the current behaviour (with a note that
future versions may achieve it less efficiently). Other options might include
sort=TRUE in case the core team develops something more efficient than
sort(unique(?)).
Regards,
Jorgen.
------------------------------
Message: 2
Date: Fri, 5 Apr 2024 11:17:37 +0300
From: Ivan Krylov <ikrylov at disroot.org>
To: Mark Webster via R-help <r-help at r-project.org>
Cc: Mark Webster <markwebster204 at yahoo.co.uk>
Subject: Re: [R]  duplicated() on zero-column data frames returns
        empty vector
Message-ID: <20240405111737.2b7e4c3a at arachnoid>
Content-Type: text/plain; charset="utf-8"
Hello Mark,
? Fri, 5 Apr 2024 03:58:36 +0000 (UTC)
Mark Webster via R-help <r-help at r-project.org> ?????:
> I found what looks to me like an odd edge case for duplicated(),
> unique() etc. on data frames with zero columns, due to duplicated()
> returning a zero-length vector for them, regardless of the number of
> rows:
> df <- data.frame(a = 1:5)
> df$a <- NULLnrow(df)
> # 5 (row count preserved by row.names)
> duplicated(df)
> # logical(0), should be c(FALSE, TRUE, TRUE, TRUE, TRUE)
> anyDuplicated(df)
> # 0, should be 2
> This behaviour isn't mentioned in the documentation; is there a
> reason for it to work like this?
<...>
> I admit this is a case we rarely care about.However, for an example
> of this being an issue, I've been running into it when treating data
> frames as database relations, where they have one or more candidate
> keys (irreducible subsets of the columns for which every row must
> have a unique value set).
Part of the problem is that it's not obvious what should be a
zero-column but non-zero-row data.frame mean.
On the one hand, your database relation use case is entirely valid. On
the other hand, if data.frames are considered to be tables of data with
row.names as their identifiers, then duplicated(d) should be returning
logical(nrow(d)) for zero-column data.frames, since row.names are
required to be unique. I'm sure that more interpretations can be
devised, requiring some other behaviour for duplicated() and friends.
Thankfully, duplicated() and anyDuplicated() are generic functions, and
you can subclass your data frames to change their behaviour:
duplicated.database_relation <- function(x, incomparables = FALSE, ...)
 if (length(x)) return(NextMethod()) else c(
  FALSE, rep(TRUE, nrow(x) - 1)
 )
.S3method('duplicated', 'database_relation')
anyDuplicated.database_relation <- function(
 x, incomparables = FALSE, ...
) if (nrow(x) > 1) 2 else 0
.S3method('anyDuplicated', 'database_relation')
x <- data.frame(row.names = 1:5)
class(x) <- c('database_relation', class(x))
duplicated(x)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE
anyDuplicated(x)
# [1] 2
unique(x)
# data frame with 0 columns and 1 row
> [[alternative HTML version deleted]]
Since this mailing list eats the HTML parts of the e-mails, we only get
the plain text version automatically prepared by your mailer. This one
didn't look so good:
https://stat.ethz.ch/pipermail/r-help/2024-April/479143.html
Composing your messages to the list in plain text will help avoid the
problem.
--
Best regards,
Ivan
	[[alternative HTML version deleted]]
Ivan Krylov
2024-Apr-07  08:00 UTC
[R] duplicated() on zero-column data frames returns empty
? Fri, 5 Apr 2024 16:08:13 +0000 Jorgen Harmse <JHarmse at roku.com> ?????:> if duplicated really treated a row name as part of the row then > any(duplicated(data.frame(?))) would always be FALSE. My expectation > is that if key1 is a subset of key2 then all(duplicated(df[key1]) >> duplicated(df[key2])) should always be TRUE.That's a good argument, thank you! Would you suggest similar changes to duplicated.matrix too? Currently it too returns 0-length output for 0-column inputs: # 0-column matrix for 0-column input str(duplicated(matrix(0, 5, 0))) # logi[1:5, 0 ] # 1-column matrix for 1-column input str(duplicated(matrix(0, 5, 1))) # logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE # a dim-1 array for >1-column input str(duplicated(matrix(0, 5, 10))) # logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE -- Best regards, Ivan
Seemingly Similar Threads
- duplicated() on zero-column data frames returns empty
- duplicated() on zero-column data frames returns empty
- duplicated() on zero-column data frames returns empty
- duplicated() on zero-column data frames returns empty
- [EXTERNAL] RE: I need to create new variables based on two numeric variables and one dichotomize conditional category variables.