Sean Zhang
2011-Dec-15 06:06 UTC
[R] how to draw random numbers from many categorical distributions quickly?
Dear R helpers, I have a question about drawing random numbers from many categorical distributions. Consider n individuals, each follows a categorical distribution defined over k categories. Consider a simple case in which n=4, k=3 as below catDisMat <- rbind(c(0.1,0.2,0.7),c(0.2,0.2,0.6),c(0.1,0.2,0.7),c(0.1,0.2,0.7)) outVec <- rep(NA,nrow(catDisMat)) for (i in 1:nrow(catDisMat)){ outVec[i] <- sample(1:3,1, prob=catDisMat[i,], replace = TRUE) } I can think of one way to potentially speed it up (in reality, my n is very large, so speed matters). The approach above only samples 1 value each time. I could have sampled two values for c(0.1,0.2,0.7) because it appears three times. so by doing some manipulation, I think I can have the idea, "sample(1:3, 3, prob=c(0.1,0.2,0.7), replace = TRUE)", implemented to improve speed a bit. But, I wonder whether there is a better approach for speed? Thanks in advance. -Sean [[alternative HTML version deleted]]
Nordlund, Dan (DSHS/RDA)
2011-Dec-15 08:34 UTC
[R] how to draw random numbers from many categorical distributions quickly?
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Sean Zhang > Sent: Wednesday, December 14, 2011 10:07 PM > To: r-help at r-project.org > Subject: [R] how to draw random numbers from many categorical > distributions quickly? > > Dear R helpers, > > I have a question about drawing random numbers from many categorical > distributions. > > Consider n individuals, each follows a categorical distribution defined > over k categories. > Consider a simple case in which n=4, k=3 as below > > catDisMat <- > rbind(c(0.1,0.2,0.7),c(0.2,0.2,0.6),c(0.1,0.2,0.7),c(0.1,0.2,0.7)) > > outVec <- rep(NA,nrow(catDisMat)) > for (i in 1:nrow(catDisMat)){ > outVec[i] <- sample(1:3,1, prob=catDisMat[i,], replace = TRUE) > } > > I can think of one way to potentially speed it up (in reality, my n is > very > large, so speed matters). The approach above only samples 1 value each > time. I could have sampled two values for c(0.1,0.2,0.7) because it > appears > three times. so by doing some manipulation, I think I can have the > idea, > "sample(1:3, 3, prob=c(0.1,0.2,0.7), replace = TRUE)", implemented to > improve speed a bit. But, I wonder whether there is a better approach > for > speed? > > Thanks in advance. > > -Sean >Sean, How about something like this: outVec <- apply(catDisMat,1, function(x)sample(1:3, 1, prob = x, replace = TRUE)) I created a catDisMat matrix with a million rows and apply crunched through it in approximately 8-9 seconds on my 2.67 GHz 64-bit Windows 7 box with 12 GB of ram. Your code above was substantially slower. Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204