thr3ads.net - R help - [R] duplicated() on zero-column data frames returns empty vector [Apr 2024]

If this information is useful, please help other people find it:
Share via:

Mark Webster

2024-Apr-05 03:58 UTC

[R] duplicated() on zero-column data frames returns empty vector

Hello,
I found what looks to me like an odd edge case for duplicated(), unique() etc.
on data frames with zero columns, due to duplicated() returning a zero-length
vector for them, regardless of the number of rows:
df <- data.frame(a = 1:5)df$a <- NULLnrow(df) # 5 (row count preserved by
row.names)duplicated(df) # logical(0), should be c(FALSE, TRUE, TRUE, TRUE,
TRUE)anyDuplicated(df) # 0, should be 2nrow(unique(df)) # 0, should be 1
This behaviour isn't mentioned in the documentation; is there a reason for
it to work like this?I'm struggling to see this as anything other than
unintended behaviour, as a consequence of the do.call(Map, `names<-(c(list,
x), NULL)`) expression in duplicated.data.frame returning an empty list instead
of a list of empty lists.
Other data frame libraries have similar behaviour: tibble does the same;
data.table, Python's pandas and Rust's polars drop all the rows as soon
as there are zero columns, because they don't preserve the row count via the
row names.
---
I admit this is a case we rarely care about.However, for an example of this
being an issue,?I've been running into it when treating data frames as
database relations, where they have one or more candidate keys (irreducible
subsets of the columns for which every row must have a unique value
set).Sometimes, a generated relation can have an empty candidate key, which
limits it to only having zero or one rows.Usually, I can check a relation
contains no duplicated key values by using anyDuplicated:
df2 <- unique(ChickWeight[, c("Chick", "Diet")])keycols
<- "Chick"?# Each chick only has one diet (Chick ->
Diet)!anyDuplicated(df2[, keycols, drop = FALSE]) # TRUE, so Chick values are
unique
When the key is empty, any row after the first must be a duplicate, but
anyDuplicated doesn't detect these because of the above edge case, so I have
to add special handling:
df3 <- data.frame(a = rep(1, 5)) # relations shouldn't have duplicate
rowskeycols <- character(0) # a is constant, so key is
empty!anyDuplicated(df3[, keycols, drop = FALSE]) # TRUE because equivalent to
!any(logical(0)) by above, should be FALSE
---
Best Regards,Mark
	[[alternative HTML version deleted]]

Ivan Krylov

2024-Apr-05 08:17 UTC

head link

[R] duplicated() on zero-column data frames returns empty vector

Hello Mark,

? Fri, 5 Apr 2024 03:58:36 +0000 (UTC)
Mark Webster via R-help <r-help at r-project.org> ?????:
> I found what looks to me like an odd edge case for duplicated(),
> unique() etc. on data frames with zero columns, due to duplicated()
> returning a zero-length vector for them, regardless of the number of
> rows:
> df <- data.frame(a = 1:5)
> df$a <- NULLnrow(df)
> # 5 (row count preserved by row.names)
> duplicated(df)
> # logical(0), should be c(FALSE, TRUE, TRUE, TRUE, TRUE)
> anyDuplicated(df)
> # 0, should be 2
> This behaviour isn't mentioned in the documentation; is there a
> reason for it to work like this?
<...>
> I admit this is a case we rarely care about.However, for an example
> of this being an issue,?I've been running into it when treating data
> frames as database relations, where they have one or more candidate
> keys (irreducible subsets of the columns for which every row must
> have a unique value set).
Part of the problem is that it's not obvious what should be a
zero-column but non-zero-row data.frame mean.

On the one hand, your database relation use case is entirely valid. On
the other hand, if data.frames are considered to be tables of data with
row.names as their identifiers, then duplicated(d) should be returning
logical(nrow(d)) for zero-column data.frames, since row.names are
required to be unique. I'm sure that more interpretations can be
devised, requiring some other behaviour for duplicated() and friends.

Thankfully, duplicated() and anyDuplicated() are generic functions, and
you can subclass your data frames to change their behaviour:

duplicated.database_relation <- function(x, incomparables = FALSE, ...)
 if (length(x)) return(NextMethod()) else c(
  FALSE, rep(TRUE, nrow(x) - 1)
 )
.S3method('duplicated', 'database_relation')

anyDuplicated.database_relation <- function(
 x, incomparables = FALSE, ...
) if (nrow(x) > 1) 2 else 0
.S3method('anyDuplicated', 'database_relation')

x <- data.frame(row.names = 1:5)
class(x) <- c('database_relation', class(x))

duplicated(x)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE
anyDuplicated(x)
# [1] 2
unique(x)
# data frame with 0 columns and 1 row
> [[alternative HTML version deleted]]
Since this mailing list eats the HTML parts of the e-mails, we only get
the plain text version automatically prepared by your mailer. This one
didn't look so good:
https://stat.ethz.ch/pipermail/r-help/2024-April/479143.html

Composing your messages to the list in plain text will help avoid the
problem.

-- 
Best regards,
Ivan

R help - Apr 2024 - duplicated() on zero-column data frames returns empty vector

[R] duplicated() on zero-column data frames returns empty vector

[R] duplicated() on zero-column data frames returns empty vector