thr3ads.net - R devel - [Rd] factor() calls sort.list unnecessarily? [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Martin Morgan

2009-Jul-03 14:57 UTC

[Rd] factor() calls sort.list unnecessarily?

R-devel,

factor(x) can take a long time on large character vectors (more than a
minute in the example below). This is because of a call to sort.list.
> str(x) chr [1:3436831] "chr5" "chr10" "chr16"
"chr3" "chr4" "chr15" ...> Rprof("/tmp/factor.Rprof")
> invisible(factor(x))
> Rprof()
> summaryRprof("/tmp/factor.Rprof")$by.self
                 self.time self.pct total.time total.pct
"sort.list"          66.14     98.9      66.14      98.9
"unique.default"      0.26      0.4       0.26       0.4
"unique"              0.24      0.4       0.50       0.7
"match"               0.24      0.4       0.24       0.4
"factor"              0.02      0.0      66.90     100.0

$by.total
                 total.time total.pct self.time self.pct
"factor"              66.90     100.0      0.02      0.0
"sort.list"           66.14      98.9     66.14     98.9
"unique"               0.50       0.7      0.24      0.4
"unique.default"       0.26       0.4      0.26      0.4
"match"                0.24       0.4      0.24      0.4

$sampling.time
[1] 66.9

sort.list is always called but used only to determine the order of
levels, so unnecessary when levels are provided. In addition, order of
levels is for unique values of x only. Perhaps these issues are
addressed in the patch below? It does require unique() on the original
argument x, rather than only on as.character(x) At the least, perhaps
sort.list can be called only when levels are not provided?

Martin

Index: src/library/base/R/factor.R
==================================================================---
src/library/base/R/factor.R (revision 48892)
+++ src/library/base/R/factor.R (working copy)
@@ -18,12 +18,13 @@
                    exclude = NA, ordered = is.ordered(x))
 {
     exclude <- as.vector(exclude, typeof(x))
-    ind <- sort.list(x) # or ?  order(x) which more (too ?) tolerant
+    if (missing(levels))
+        ind <- sort.list(unique(x))
     nx <- names(x)
     force(ordered)
     x <- as.character(x)
     if(missing(levels)) # get unique levels ordered by the original values
-       levels <- unique(x[ind])
+       levels <- unique(x)[ind]
     levels <- levels[is.na(match(levels, exclude))]
     f <- match(x, levels)
     if(!is.null(nx))

Petr Savicky

2009-Jul-05 09:07 UTC

head link

[Rd] factor() calls sort.list unnecessarily?

On Fri, Jul 03, 2009 at 07:57:49AM -0700, Martin Morgan wrote:
[...]> sort.list is always called but used only to determine the order of
> levels, so unnecessary when levels are provided.
I think, this is correct. Replacing 
  ind <- sort.list(x)
by
  if (missing(levels))
      ind <- sort.list(x)
makes factor() more efficient, when levels parameter is not missing
and since variable ind is not needed in this case, i think, the
modification in the above form (without unique()) is correct.
Parameter levels is not used between the two tests of missing(levels),
so we get the same result in both cases as needed.
> In addition, order of
> levels is for unique values of x only. Perhaps these issues are
> addressed in the patch below? It does require unique() on the original
> argument x, rather than only on as.character(x)
Computing the order of levels is sufficient for unique levels, however,
if x is numeric, then the operation
  x <- as.character(x)
performs rounding, due to which different, but very close numbers 
are mapped to the same value. So, the length of unique(x) may change.

A possible solution could be to keep unique(x) for the original values
and refer to it, when constructing the levels. For example, as follows.
  y <- unique(x)
  ind <- sort.list(y)
  y <- as.character(y)
  levels <- unique(y[ind])
  x <- as.character(x)
  f <- match(x, levels)
This is more efficient, if the length of unique(x) is significantly
smaller then the length of x. On the other hand, if their lengths are
similar, then computing as.character() on both x and y incures some
slow down.
> At the least, perhaps
> sort.list can be called only when levels are not provided?
I support this part of the patch without unique().

Petr.

R devel - Jul 2009 - factor() calls sort.list unnecessarily?

[Rd] factor() calls sort.list unnecessarily?

[Rd] factor() calls sort.list unnecessarily?