I am working with a dataset where there are 5 possible outcomes (coded 1:5), I would like to create 5 categorical variables (event1...event5). I am using a for loop an if statements, but I have a large dataset( approx 100,000 rows) it takes quite a bit of time, is there a way to speed this up? Here is some sample code of what I am currently doing. test2 <-rep(seq(1:5),2000) event1 <- rep(0,nrow(test2)) event2 <- rep(0,nrow(test2)) event3 <- rep(0,nrow(test2)) event4 <- rep(0,nrow(test2)) event5 <- rep(0,nrow(test2)) for(i in 1:length(event1)) { if (test2[i]==1) { event1[i]=1 } if (test2[i]==2) { event2[i]=1 } if (test2[i]==3) { event3[i]=1 } if (test2[i]==4) { event4[i]=1 } if (test2[i]==5) { event5[i]=1 } } thanks, Spencer [[alternative HTML version deleted]]
sj wrote:> I am working with a dataset where there are 5 possible outcomes (coded 1:5), > I would like to create 5 categorical variables (event1...event5). I am using > a for loop an if statements, but I have a large dataset( approx 100,000 > rows) it takes quite a bit of time, is there a way to speed this up? Here is > some sample code of what I am currently doing.Here is one way you might do it: X <- sample(1:5, 100, replace=TRUE) # Your 5 event variables in a matrix model.matrix(lm(rnorm(length(X)) ~ as.factor(X) - 1)) Also, along the lines of your approach below, the following using ifelse() might be better: event3 <- ifelse(test2 == 3, 1, 0) I'm sure other people will post different solutions probably more elegant than these.> test2 <-rep(seq(1:5),2000) > > event1 <- rep(0,nrow(test2)) > event2 <- rep(0,nrow(test2)) > event3 <- rep(0,nrow(test2)) > event4 <- rep(0,nrow(test2)) > event5 <- rep(0,nrow(test2)) > > for(i in 1:length(event1)) > { > if (test2[i]==1) > { > event1[i]=1 > } > > if (test2[i]==2) > { > event2[i]=1 > } > > if (test2[i]==3) > { > event3[i]=1 > } > > if (test2[i]==4) > { > event4[i]=1 > } > > if (test2[i]==5) > { > event5[i]=1 > } > } > > > > thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 512-0171 (M, W, F) fax: (917) 438-0894
Richard M. Heiberger
2006-Dec-29 19:11 UTC
[R] coded to categorical variables in a large dataset
## The main reason for wanting such a coding is to use it in ## a linear model. Therefore, declare the variable to be a factor ## and use it directly. tmp <- sample(1:5, 40, replace=TRUE) tmpf <- factor(tmp) tmp.y <- rnorm(40) tmp.aov <- aov(tmp.y ~ tmpf) summary(tmp.aov) contrasts(tmpf) update(tmp.aov, x=TRUE)$x[1:6,] ## If you really want to see the redundant column 1 of ## of the contrasts, that can be done with the statement contrasts(tmpf) contrasts(tmpf, how.many=5) <- contr.treatment(5, contrasts=FALSE) contrasts(tmpf) tmp2.aov <- aov(tmp.y ~ tmpf) summary(tmp2.aov) update(tmp2.aov, x=TRUE)$x[1:6,]
try this:> test2 <- rep(seq(1:5),2000) > #setup a data frame and index into the columns > result <- data.frame(event1=rep(0,length(test2)),event2=rep(0,length(test2)), + event3=rep(0,length(test2)), event4=rep(0,length(test2)), event5=rep(0,length(test2)))> for (i in seq(ncol(result))){+ result[[i]] <- ifelse(test2 == i, 1, 0) + }> > str(result)'data.frame': 10000 obs. of 5 variables: $ event1: num 1 0 0 0 0 1 0 0 0 0 ... $ event2: num 0 1 0 0 0 0 1 0 0 0 ... $ event3: num 0 0 1 0 0 0 0 1 0 0 ... $ event4: num 0 0 0 1 0 0 0 0 1 0 ... $ event5: num 0 0 0 0 1 0 0 0 0 1 ... On 12/29/06, sj <ssj1364@gmail.com> wrote:> > I am working with a dataset where there are 5 possible outcomes (coded > 1:5), > I would like to create 5 categorical variables (event1...event5). I am > using > a for loop an if statements, but I have a large dataset( approx 100,000 > rows) it takes quite a bit of time, is there a way to speed this up? Here > is > some sample code of what I am currently doing. > > test2 <-rep(seq(1:5),2000) > > event1 <- rep(0,nrow(test2)) > event2 <- rep(0,nrow(test2)) > event3 <- rep(0,nrow(test2)) > event4 <- rep(0,nrow(test2)) > event5 <- rep(0,nrow(test2)) > > for(i in 1:length(event1)) > { > if (test2[i]==1) > { > event1[i]=1 > } > > if (test2[i]==2) > { > event2[i]=1 > } > > if (test2[i]==3) > { > event3[i]=1 > } > > if (test2[i]==4) > { > event4[i]=1 > } > > if (test2[i]==5) > { > event5[i]=1 > } > } > > > > thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]
Gabor Grothendieck
2006-Dec-29 19:40 UTC
[R] coded to categorical variables in a large dataset
As Richard has already pointed out you may only need to convert your numeric vector to a factor but just in case here are a few direct answers: Using X from Chuck's post here are two ways of creating a 100x5 matrix of indicator variables: model.matrix(~ X-1, list(X = factor(X))) outer(X, 1:5, "==")+0 # To create eventi variables # here is a way of creating them event1 <- (X == 1) + 0 # and similarly for 2, 3, 4, 5 # or do it in a loop for(i in 1:5) assign(paste("event", i, sep = ""), (X == i) + 0) # or create as columns of a data frame f <- function(i, j) (X == j) + 0 as.data.frame(mapply(f, paste("event", 1:5, sep = ""), 1:5)) On 12/29/06, sj <ssj1364 at gmail.com> wrote:> I am working with a dataset where there are 5 possible outcomes (coded 1:5), > I would like to create 5 categorical variables (event1...event5). I am using > a for loop an if statements, but I have a large dataset( approx 100,000 > rows) it takes quite a bit of time, is there a way to speed this up? Here is > some sample code of what I am currently doing. > > test2 <-rep(seq(1:5),2000) > > event1 <- rep(0,nrow(test2)) > event2 <- rep(0,nrow(test2)) > event3 <- rep(0,nrow(test2)) > event4 <- rep(0,nrow(test2)) > event5 <- rep(0,nrow(test2)) > > for(i in 1:length(event1)) > { > if (test2[i]==1) > { > event1[i]=1 > } > > if (test2[i]==2) > { > event2[i]=1 > } > > if (test2[i]==3) > { > event3[i]=1 > } > > if (test2[i]==4) > { > event4[i]=1 > } > > if (test2[i]==5) > { > event5[i]=1 > } > } > > > > thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Charles C. Berry
2006-Dec-29 23:25 UTC
[R] coded to categorical variables in a large dataset
On Fri, 29 Dec 2006, sj wrote:> I am working with a dataset where there are 5 possible outcomes (coded 1:5), > I would like to create 5 categorical variables (event1...event5). I am using > a for loop an if statements, but I have a large dataset( approx 100,000 > rows) it takes quite a bit of time, is there a way to speed this up? Here is > some sample code of what I am currently doing. > > test2 <-rep(seq(1:5),2000) >[...] As Richard suggested you may not want to do this at all, but ... If you want these as a matrix, this is fast and direct: mat <- diag(5)[ test2, ] If not as a matrix event1 <- as.numeric( test2 == 1 ) is concise and for (i in 1:5) assign(paste("event",i,sep=""), as.numeric( test2==i )) is about as fast as you can get. HTH, Chuck Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717