Chuck White
2010-Jan-26 20:12 UTC
[R] splitting a factor column into binary columns for each factor
Yesterday I posted the following question (my apologies for not putting a subject line): =================question=====================Hello -- I would like to know of a more efficient way of writing the following piece of code. Thanks. options(stringsAsFactors=FALSE) orig <- c(rep('11111111',100000),rep('22222222',200000),rep('33333333' ,300000),rep('44444444',400000)) orig.unique <- unique(orig) system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig==x, 1, 0)))) =========================================== I received a response via e-mail which was **extremely** useful. =================answer=====================Using sapply instead of lapply here is a waste. sapply() calls lapply(), which returns a list that sapply() turns into a list by making each list element a column of the matrix. data.frame(matrix) then makes a list from the columns of the matrix. The one thing that sapply gives you and lapply doesn't is column names. If you attach names to orig.unique then lapply's output will have them. Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x). I wrote functions g0 (containing your code), g1 (using lapply), and g2 (ifelse->as.numeric). I parameterized them by the number of '1111111' elements and they each return the data.frame created and the time it took to do it:> g0function(n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) list(time = time, df = df) }> g1function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) list(time = time, df = df) }> g2function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) as.numeric(orig == x)))) list(time = time, df = df) } For n=10^5 the times were> g0(1e5)$timeuser system elapsed 20.65 0.41 20.64> g1(1e5)$timeuser system elapsed 2.35 0.05 2.36> g2(1e5)$timeuser system elapsed 0.73 0.10 0.77 and the data.frames each produced were identical. Another approach is to use outer() to make a matrix that gets passed to data.frame(). It seems slightly slower than g2, but small changes might make it faster.> g3function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2 * n), rep("33333333", 3 * n), rep("44444444", 4 * n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, outer(orig, orig.unique, function(x, y) as.numeric(x==y)))) list(time = time, df = df) }> g3(1e5)$timeuser system elapsed 1.02 0.00 0.97 When you want to optimize code it is often handy to write functions like this to do the timing for various problem sizes. You can quickly experiment with small versions of the problem to make sure the results are correct and the time looks reasonable and later see if the times scale up as hoped to your desired problem size.
Chuck White
2010-Jan-26 20:12 UTC
[R] splitting a factor column into binary columns for each factor
Yesterday I posted the following question (my apologies for not putting a subject line): =================question=====================Hello -- I would like to know of a more efficient way of writing the following piece of code. Thanks. options(stringsAsFactors=FALSE) orig <- c(rep('11111111',100000),rep('22222222',200000),rep('33333333' ,300000),rep('44444444',400000)) orig.unique <- unique(orig) system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig==x, 1, 0)))) =========================================== I received a response via e-mail which was **extremely** useful. =================answer=====================Using sapply instead of lapply here is a waste. sapply() calls lapply(), which returns a list that sapply() turns into a list by making each list element a column of the matrix. data.frame(matrix) then makes a list from the columns of the matrix. The one thing that sapply gives you and lapply doesn't is column names. If you attach names to orig.unique then lapply's output will have them. Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x). I wrote functions g0 (containing your code), g1 (using lapply), and g2 (ifelse->as.numeric). I parameterized them by the number of '1111111' elements and they each return the data.frame created and the time it took to do it:> g0function(n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) list(time = time, df = df) }> g1function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) list(time = time, df = df) }> g2function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) as.numeric(orig == x)))) list(time = time, df = df) } For n=10^5 the times were> g0(1e5)$timeuser system elapsed 20.65 0.41 20.64> g1(1e5)$timeuser system elapsed 2.35 0.05 2.36> g2(1e5)$timeuser system elapsed 0.73 0.10 0.77 and the data.frames each produced were identical. Another approach is to use outer() to make a matrix that gets passed to data.frame(). It seems slightly slower than g2, but small changes might make it faster.> g3function (n = 1e+05) { orig <- c(rep("11111111", n), rep("22222222", 2 * n), rep("33333333", 3 * n), rep("44444444", 4 * n)) orig.unique <- unique(orig) names(orig.unique) <- orig.unique time <- system.time(df <- data.frame(check.names=FALSE, outer(orig, orig.unique, function(x, y) as.numeric(x==y)))) list(time = time, df = df) }> g3(1e5)$timeuser system elapsed 1.02 0.00 0.97 When you want to optimize code it is often handy to write functions like this to do the timing for various problem sizes. You can quickly experiment with small versions of the problem to make sure the results are correct and the time looks reasonable and later see if the times scale up as hoped to your desired problem size.