> [R] Yet another set of codes to optimize
> Daren Tan daren76 at hotmail.com
> Fri Dec 5 03:41:23 CET 2008
>
> I have problems converting my dataset from long to wide format.
Previous> attempts using reshape package and aggregate function were
unsuccessful> as they took too long. Apparently, my simplified solution also lasted
> as long.
>
> My complete codes is given below. When sample.size = 10000, the
> execution takes about 20 seconds. But sample.size = 100000 seems to
take> eternity. My actual sample.size is 15000000 i.e. 15 million.
>
> sample.size <- 10000
>
> m <- data.frame(Name=sample(1:100000, sample.size, T),
Type=sample(1:1000,> sample.size, T), Predictor=sample(LETTERS[1:10], sample.size, T))
>
> res <- function(m) {
> m.12.unique <- unique(m[,1:2])
> m.12.unique <- m.12.unique[order(m.12.unique[,1],
m.12.unique[,2]),]> v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".")
> v2 <- c(sort(unique(m[,3])))
> res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))> m.ids <- paste(m[,1], m[,2], sep=".")
> for(i in 1:nrow(m)) {
> x <- m.ids[i]
> y <- m[i,3]
> res[x, y] <- res[x, y] + 1
> }
> res <- data.frame(m.12.unique[,1], m.12.unique[,2], res,
row.names=NULL)> colnames(res) <- c("Name", "Type", v2)
> return(res)
> }
>
> res(m)
Your for loop is tabulating the items in m.ids and m[,3]
so think of using table(). E.g., replace
res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
for(i in 1:nrow(m)) {
x <- m.ids[i]
y <- m[i,3]
res[x, y] <- res[x, y] + 1
}
with
res<-table(factor(m.ids,levels=v1), factor(m[,3]))
There is a bit of trickiness in putting this table into
the data.frame. Since as.data.frame(tableObject) works very
differently than as.data.frame(matrixObject), the naive
data.frame(m.12.unique[,1], m.12.unique[,2], res, row.names=NULL)
fails. You need to convert the table res into a matrix with
the same data, dimensions, and dimnames.
data.frame(m.12.unique[,1], m.12.unique[,2], as.matrix(res),
row.names=NULL)
also fails because a "table" object is a "matrix" object so
as.matrix(tableObject) returns its input, unchanged.
as(res,"matrix") seems to work, as the the wordier
but more explicit array(res,dim(res),dimnames(res)).
res1 <-
function(m) {
m.12.unique <- unique(m[,1:2])
m.12.unique <- m.12.unique[order(m.12.unique[,1], m.12.unique[,2]),]
v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".")
v2 <- c(sort(unique(m[,3])))
res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
m.ids <- paste(m[,1], m[,2], sep=".")
res <- table(factor(m.ids,levels=v1), factor(m[,3]))
res <- data.frame(m.12.unique[,1], m.12.unique[,2],
as(res, "matrix"), row.names=NULL)
colnames(res) <- c("Name", "Type", v2)
return(res)
}
Here is a table of times for your original function, time0,
and this modified one, time0. It looks like res1 eventually
becomes worse than linear, but for a much larger size than
your original. sort() and unique() cannot have linear time
so they may be becoming factors at size=1e6.
size time0 time1
1 10 0.012 0.012
2 100 0.032 0.014
3 200 0.061 0.016
4 400 0.126 0.020
5 800 0.286 0.028
6 1000 0.383 0.033
7 2000 2.337 0.054
8 4000 8.578 0.100
9 8000 39.955 0.214
10 10000 68.767 0.318
11 20000 327.973 1.057
12 100000 NA 3.021
12 1000000 NA 89.881
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com