R factors are the natural way to represent factors -- and should be efficient since they use small integers. But in fact, for many (but not all) operations, R factors are considerably slower than integers, or even character strings. This appears to be because whenever a factor vector is subsetted, the entire levels vector is copied. For example:> i1 <- sample(1e4,1e6,replace=T) > c1 <- paste('x',i1) > f1 <- factor(c1) > system.time(replicate(1e4,{q1<-i1[100:200];1}))user system elapsed 0.03 0.00 0.04> system.time(replicate(1e4,{q1<-c1[100:200];1}))user system elapsed 0.04 0.00 0.04> system.time(replicate(1e4,{q1<-f1[100:200];1}))user system elapsed 0.67 0.00 0.68 Putting the levels vector in an environment speeds up subsetting: myfactor <- function(...) { f <- factor(...) g <- unclass(f) class(g) <- "myfactor" attr(g,"mylevels") <- as.environment(list(levels=attr(f,"mylevels"))) g } `[.myfactor` <- function (x, ...) { y <- NextMethod("[") attributes(y) <- attributes(x) y }> m1 <- myfactor(f1) > system.time(replicate(1e4,{q1<-m1[100:200];1}))user system elapsed 0.05 0.00 0.04 Given R's value semantics, I believe this approach can be extended to most of class factor's functionality without problems, copying the environment if necessary. Some quick tests seem to show that this is no slower than ordinary factors even for very small numbers of levels. To do this, appropriate methods for this class (print, [<-, levels<-, etc.) would have to be written. Perhaps some core R functions also have to be changed? Am I missing some obvious flaw in this approach? Has anyone already implemented a factors package using this or some similar approach? Thanks, -s
Le vendredi 04 novembre 2011 ? 19:19 -0400, Stavros Macrakis a ?crit :> R factors are the natural way to represent factors -- and should be > efficient since they use small integers. But in fact, for many (but > not all) operations, R factors are considerably slower than integers, > or even character strings. This appears to be because whenever a > factor vector is subsetted, the entire levels vector is copied.Is it so common for a factor to have so many levels? One can probably argue that, in that case, using a numeric or character vector is preferred - factors are no longer the "natural way" of representing this kind of data. Adding code to fix a completely theoretical problem is generally not a good idea. I think you'd have to come up with a real use case to hope convincing the developers a change is needed. There are probably many more interesting areas where speedups can be gained than that. Regards
Milan, Jeff, Patrick, Thank you for your comments and suggestions. Milan, This is far from a "completely theoretical problem". I am performing text analytics on a corpus of about 2m documents. There are tens of thousands of distinct words (lemmata). It seems to me that the natural representation of words is as an "enumeration type" -- in R terms, a "factor". Why do I think factors are the "natural way" of representing such things? Because for most kinds of analysis, only their *identity* matters (not their spelling as words), but the human user would like to see names, not numbers. That is pretty much the definition of an enumeration type. In terms of R implementation, R is very efficient in dealing with integer identities and indexing (e.g. tabulate) and not very efficient in dealing with character identities -- indeed, 'table' first converts strings into factors. Of course I could represent the lemmata as integers, and perform the translation between integers and strings myself, but that would just be duplicating the function of an enumeration type. Jeffrey, Extending R "via the mechanisms in place" is exactly what I have in mind. Of course, if it's already been done, I'd rather reuse that work than start from scratch, which is why my message explicitly asks if there is a "factors package using this or some similar approach". I did search CRAN, and wasn't able to find such a thing, but I may have missed something, which is why I sent my message to the list. Patrick, Data.table certainly has some useful mechanisms, and I've been experimenting with it as an implementation mechanism, though it's not a drop-in substitute for factors. Also, though it is efficient for set operations between small sets and large sets, it is not very efficient for operations between two large sets -- I am working with its implementors to see if we can put in place a better algorithm based on e.g. Demaine et al.<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963>and Barbay et al <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.9365>. Thanks everyone, and if you do come across a relevant CRAN package, I'd be very interested in hearing about it. -s [[alternative HTML version deleted]]