I have a data frame with columns which draw on the same underlying universe, so I want them to be factors with the same level set: --8<---------------cut here---------------start------------->8---> z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE) > str(z)'data.frame': 3 obs. of 2 variables: $ a: chr "a" "b" "c" $ b: chr "b" "c" "d"> z$a <- factor(z$a,levels=union(z$a,z$b)) > z$b <- factor(z$b,levels=union(z$a,z$b)) > str(z)'data.frame': 3 obs. of 2 variables: $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3 $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4 --8<---------------cut here---------------end--------------->8--- factor(z$a,levels=union(z$a,z$b)) is factor(z$a,levels=union(z$a,z$b)) the right way to handle this? maybe there is a better way to extract levels than union()? (bear in mind that I have ~10M rows and ~1M levels, so performance is an issue). Thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://iris.org.il http://honestreporting.com http://camera.org http://www.memritv.org http://jihadwatch.org When you talk to God, it's prayer; when He talks to you, it's schizophrenia.
Hello, The obvious simplification is to call union() only once. With 10M rows it should save time. Then I've asked myself whether unique() wouldn't be faster. f1 <- function(x){ x[[1]] <- factor(x[[1]], levels = union(x[[1]], x[[2]])) x[[2]] <- factor(x[[2]], levels = union(x[[1]], x[[2]])) x } f2 <- function(x){ levels <- union(x[[1]], x[[2]]) x[[1]] <- factor(x[[1]], levels = levels) x[[2]] <- factor(x[[2]], levels = levels) x } f3 <- function(x){ levels <- unique(c(x[[1]], x[[2]])) x[[1]] <- factor(x[[1]], levels = levels) x[[2]] <- factor(x[[2]], levels = levels) x } set.seed(5467) n <- 1e7 z <- data.frame(a = sample(letters[1:3], n, TRUE), b = sample(letters[2:4], n, TRUE), stringsAsFactors=FALSE) t1 <- system.time(z1 <- f1(z)) t2 <- system.time(z2 <- f2(z)) t3 <- system.time(z3 <- f3(z)) identical(z1, z2) #[1] TRUE identical(z1, z3) #[1] TRUE rbind(t1, t2, t3) user.self sys.self elapsed user.child sys.child t1 2.55 0.47 3.01 NA NA t2 1.57 0.29 1.87 NA NA t3 1.51 0.26 1.78 NA NA Hope this helps, Rui Barradas Em 16-09-2012 17:46, Sam Steingold escreveu:> I have a data frame with columns which draw on the same underlying > universe, so I want them to be factors with the same level set: > > --8<---------------cut here---------------start------------->8--- >> z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE) >> str(z) > 'data.frame': 3 obs. of 2 variables: > $ a: chr "a" "b" "c" > $ b: chr "b" "c" "d" >> z$a <- factor(z$a,levels=union(z$a,z$b)) >> z$b <- factor(z$b,levels=union(z$a,z$b)) >> str(z) > 'data.frame': 3 obs. of 2 variables: > $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3 > $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4 > --8<---------------cut here---------------end--------------->8--- > factor(z$a,levels=union(z$a,z$b)) > is factor(z$a,levels=union(z$a,z$b)) the right way to handle this? > maybe there is a better way to extract levels than union()? > (bear in mind that I have ~10M rows and ~1M levels, so performance is an > issue). > > Thanks! >
If you have a million levels is it really necessary to use a factor? I'm not sure what advantages it will to have to a string in this circumstance (especially since you don't seem to know the levels a priori but have to learn them from the data). Hadley On Sunday, September 16, 2012, Sam Steingold wrote:> I have a data frame with columns which draw on the same underlying > universe, so I want them to be factors with the same level set: > > --8<---------------cut here---------------start------------->8--- > > z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE) > > str(z) > 'data.frame': 3 obs. of 2 variables: > $ a: chr "a" "b" "c" > $ b: chr "b" "c" "d" > > z$a <- factor(z$a,levels=union(z$a,z$b)) > > z$b <- factor(z$b,levels=union(z$a,z$b)) > > str(z) > 'data.frame': 3 obs. of 2 variables: > $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3 > $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4 > --8<---------------cut here---------------end--------------->8--- > factor(z$a,levels=union(z$a,z$b)) > is factor(z$a,levels=union(z$a,z$b)) the right way to handle this? > maybe there is a better way to extract levels than union()? > (bear in mind that I have ~10M rows and ~1M levels, so performance is an > issue). > > Thanks! > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X > 11.0.11103000 > http://www.childpsy.net/ http://iris.org.il http://honestreporting.com > http://camera.org http://www.memritv.org http://jihadwatch.org > When you talk to God, it's prayer; when He talks to you, it's > schizophrenia. > > ______________________________________________ > R-help@r-project.org <javascript:;> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- RStudio / Rice University http://had.co.nz/ [[alternative HTML version deleted]]