thr3ads.net - R devel - [Rd] match and unique [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Therneau, Terry M., Ph.D.

2016-Mar-16 15:03 UTC

[Rd] match and unique

Is the phrase  "index <- match(x, sort(unique(x)))" reliable, in
the sense that it will
never return NA?

Context: Calculation of survival curves involves the concept of unique death
times.  I've
had reported cases in the past where survfit failed, and it was due to the fact
that two
"differ by machine precision" values would sometimes match and
sometimes not, depending on
how I compared them.  I've dealt with those piecemeal in the past, but am
going to do a
code review and make sure that I do things consistently throughout the survival
package.
The basic plan will be to change time to an integer, do all the work, then
restore labels
at the end.  The above line is one candidate for the first step.

An alternative is index <- as.numeric(factor(x)), with
as.numeric(levels(factor(x))) as
the final labeling step.  This is a more severe rounding, is it not?  But
perhaps it is
preferable? The KM branch of the current survfit routine does this, and I've
had one user
report a bug in that
     x <- runif(20)
     fit <- survfit(Surv(x) ~1)
     summary(fit, times=x)
will produce lines with 0, 1 or 2 events when "they all should be 1".

The same issue just came up in an rpart example, sent to me.  For coxph models
is may only
be a matter of time.

Suggestions and opinions are welcome.

Terry Therneau

Hervé Pagès

2016-Mar-17 20:20 UTC

head link

[Rd] match and unique

Hi Terry,

On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:> Is the phrase  "index <- match(x, sort(unique(x)))" reliable,
in the
> sense that it will never return NA?
This is assuming that match() and unique() will never disagree on
equality between 2 floating point values. I believe they share some
code internally (same hashing routine?), so maybe it's reliable.

Anyway, it's always preferable to not rely on this kind of assumption.

A safer thing to do is to use rank():

   r <- rank(x, ties.method="min")  # could use "max"

Think of 'r' as a unique ID assigned to each value in 'x'. This
ID takes
its values in the (1,length(x)) range but we want it to take its values
in the (1,length(unique(x))) range:

   ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
   index <- ID_remapping[r]

'index' will be the same as 'match(x, sort(unique(x))' but
doesn't rely
on the assumption that match() and unique() agree on equality between
2 floating point values.

Unfortunately rank() is very slow, much slower than sort(). Here is a
faster solution based on sort.list(x, na.last=NA, method="quick"):

   assignID <- function(x)
   {
     oo <- sort.list(x, na.last=NA, method="quick")
     sorted <- x[oo]
     is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
     sorted_ID <- cumsum(is_unique)
     ID <- integer(length(x))
     ID[oo] <- sorted_ID
     ID
   }

'assignID(x)' is also slightly faster than 'match(x,
sort(unique(x)))':

   x <- runif(5000000)
   system.time(index1 <- match(x, sort(unique(x))))
   #   user  system elapsed
   #  2.170   0.552   2.725

   system.time(index2 <- assignID(x))
   #   user  system elapsed
   #  0.885   0.032   0.917

   identical(index1, index2)
   # [1] TRUE

Cheers,
H.
>
> Context: Calculation of survival curves involves the concept of unique
> death times.  I've had reported cases in the past where survfit failed,
> and it was due to the fact that two "differ by machine precision"
values
> would sometimes match and sometimes not, depending on how I compared
> them.  I've dealt with those piecemeal in the past, but am going to do
a
> code review and make sure that I do things consistently throughout the
> survival package. The basic plan will be to change time to an integer,
> do all the work, then restore labels at the end.  The above line is one
> candidate for the first step.
>
> An alternative is index <- as.numeric(factor(x)), with
> as.numeric(levels(factor(x))) as the final labeling step.  This is a
> more severe rounding, is it not?  But perhaps it is preferable? The KM
> branch of the current survfit routine does this, and I've had one user
> report a bug in that
>      x <- runif(20)
>      fit <- survfit(Surv(x) ~1)
>      summary(fit, times=x)
> will produce lines with 0, 1 or 2 events when "they all should be
1".
>
> The same issue just came up in an rpart example, sent to me.  For coxph
> models is may only be a matter of time.
>
> Suggestions and opinions are welcome.
>
> Terry Therneau
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Therneau, Terry M., Ph.D.

2016-Mar-17 21:37 UTC

head link

[Rd] match and unique

Herve,
   Thank you for the insightful response.  It led to the following check:
> set.seed(1953)  # a good year...
> x <- runif(5e6)
> ux <- sort(unique(x))
> length(ux)[1] 4997142
> table(is.na(match(x, ux))   FALSE
5000000
> index1 <- match(x, ux)
> index2 <- findInterval(x, ux)
> all.equal(index1, index2)[1] TRUE
> length(levels(factor(x)))[1] 999885

--------------------------

1.  At the terminal, the factor command was definitely the slowest, everything
else was
quite reasonable for 1 million values, which is larger than I expect for a
survival curve.

2. The match and unique functions agreed for this data set, but you are correct
that this
is not guarranteed.  The findInterval command will alleviate this.

3. Factor has the advantage of documented behavior, via its call to
as.character.  The
conversion to and from character is likely what slows it down.

Terry



On 03/17/2016 03:20 PM, Herv? Pag?s wrote:> Hi Terry,
>
> On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:
>> Is the phrase  "index <- match(x, sort(unique(x)))"
reliable, in the
>> sense that it will never return NA?
>
> This is assuming that match() and unique() will never disagree on
> equality between 2 floating point values. I believe they share some
> code internally (same hashing routine?), so maybe it's reliable.
>
> Anyway, it's always preferable to not rely on this kind of assumption.
>
> A safer thing to do is to use rank():
>
>    r <- rank(x, ties.method="min")  # could use
"max"
>
> Think of 'r' as a unique ID assigned to each value in 'x'.
This ID takes
> its values in the (1,length(x)) range but we want it to take its values
> in the (1,length(unique(x))) range:
>
>    ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
>    index <- ID_remapping[r]
>
> 'index' will be the same as 'match(x, sort(unique(x))' but
doesn't rely
> on the assumption that match() and unique() agree on equality between
> 2 floating point values.
>
> Unfortunately rank() is very slow, much slower than sort(). Here is a
> faster solution based on sort.list(x, na.last=NA,
method="quick"):
>
>    assignID <- function(x)
>    {
>      oo <- sort.list(x, na.last=NA, method="quick")
>      sorted <- x[oo]
>      is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
>      sorted_ID <- cumsum(is_unique)
>      ID <- integer(length(x))
>      ID[oo] <- sorted_ID
>      ID
>    }
>
> 'assignID(x)' is also slightly faster than 'match(x,
sort(unique(x)))':
>
>    x <- runif(5000000)
>    system.time(index1 <- match(x, sort(unique(x))))
>    #   user  system elapsed
>    #  2.170   0.552   2.725
>
>    system.time(index2 <- assignID(x))
>    #   user  system elapsed
>    #  0.885   0.032   0.917
>
>    identical(index1, index2)
>    # [1] TRUE
>
> Cheers,
> H.
>
>>
>> Context: Calculation of survival curves involves the concept of unique
>> death times.  I've had reported cases in the past where survfit
failed,
>> and it was due to the fact that two "differ by machine
precision" values
>> would sometimes match and sometimes not, depending on how I compared
>> them.  I've dealt with those piecemeal in the past, but am going to
do a
>> code review and make sure that I do things consistently throughout the
>> survival package. The basic plan will be to change time to an integer,
>> do all the work, then restore labels at the end.  The above line is one
>> candidate for the first step.
>>
>> An alternative is index <- as.numeric(factor(x)), with
>> as.numeric(levels(factor(x))) as the final labeling step.  This is a
>> more severe rounding, is it not?  But perhaps it is preferable? The KM
>> branch of the current survfit routine does this, and I've had one
user
>> report a bug in that
>>      x <- runif(20)
>>      fit <- survfit(Surv(x) ~1)
>>      summary(fit, times=x)
>> will produce lines with 0, 1 or 2 events when "they all should be
1".
>>
>> The same issue just came up in an rpart example, sent to me.  For coxph
>> models is may only be a matter of time.
>>
>> Suggestions and opinions are welcome.
>>
>> Terry Therneau
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Seemingly Similar Threads

Search for more apparently analagous threads

R devel - Mar 2016 - match and unique

[Rd] match and unique

[Rd] match and unique

[Rd] match and unique

Seemingly Similar Threads