Chuck White
2010-Jan-26 20:12 UTC
[R] splitting a factor column into binary columns for each factor
Yesterday I posted the following question (my apologies for not putting a
subject line):
=================question=====================Hello -- I would like to know of a
more efficient way of writing the following piece of code. Thanks.
options(stringsAsFactors=FALSE)
orig <-
c(rep('11111111',100000),rep('22222222',200000),rep('33333333'
,300000),rep('44444444',400000))
orig.unique <- unique(orig)
system.time(df <- as.data.frame(sapply(orig.unique, function(x)
ifelse(orig==x, 1, 0))))
===========================================
I received a response via e-mail which was **extremely** useful.
=================answer=====================Using sapply instead of lapply here
is a waste. sapply() calls lapply(), which returns a list that sapply() turns
into a list by making each list element a column of the matrix.
data.frame(matrix) then makes a list from the columns of the matrix.
The one thing that sapply gives you and lapply doesn't is column names. If
you attach names to orig.unique then lapply's output will have them.
Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x). I
wrote functions g0 (containing your code), g1 (using lapply), and g2
(ifelse->as.numeric). I parameterized them by the number of
'1111111' elements and they each return the data.frame created and the
time it took to do it:
> g0
function(n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x)
ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g1
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE,
lapply(orig.unique, function(x) ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g2
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE,
lapply(orig.unique, function(x) as.numeric(orig == x))))
list(time = time, df = df)
}
For n=10^5 the times were > g0(1e5)$time
user system elapsed
20.65 0.41 20.64 > g1(1e5)$time
user system elapsed
2.35 0.05 2.36 > g2(1e5)$time
user system elapsed
0.73 0.10 0.77
and the data.frames each produced were identical.
Another approach is to use outer() to make a matrix that gets passed to
data.frame(). It seems slightly slower than g2, but small changes might make it
faster.
> g3
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2 * n),
rep("33333333", 3 * n), rep("44444444", 4 * n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE, outer(orig,
orig.unique, function(x, y) as.numeric(x==y))))
list(time = time, df = df)
}
> g3(1e5)$time
user system elapsed
1.02 0.00 0.97
When you want to optimize code it is often handy to write functions like this to
do the timing for various problem sizes. You can quickly experiment with small
versions of the problem to make sure the results are correct and the time looks
reasonable and later see if the times scale up as hoped to your desired problem
size.
Chuck White
2010-Jan-26 20:12 UTC
[R] splitting a factor column into binary columns for each factor
Yesterday I posted the following question (my apologies for not putting a
subject line):
=================question=====================Hello -- I would like to know of a
more efficient way of writing the following piece of code. Thanks.
options(stringsAsFactors=FALSE)
orig <-
c(rep('11111111',100000),rep('22222222',200000),rep('33333333'
,300000),rep('44444444',400000))
orig.unique <- unique(orig)
system.time(df <- as.data.frame(sapply(orig.unique, function(x)
ifelse(orig==x, 1, 0))))
===========================================
I received a response via e-mail which was **extremely** useful.
=================answer=====================Using sapply instead of lapply here
is a waste. sapply() calls lapply(), which returns a list that sapply() turns
into a list by making each list element a column of the matrix.
data.frame(matrix) then makes a list from the columns of the matrix.
The one thing that sapply gives you and lapply doesn't is column names. If
you attach names to orig.unique then lapply's output will have them.
Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x). I
wrote functions g0 (containing your code), g1 (using lapply), and g2
(ifelse->as.numeric). I parameterized them by the number of
'1111111' elements and they each return the data.frame created and the
time it took to do it:
> g0
function(n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x)
ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g1
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE,
lapply(orig.unique, function(x) ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g2
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n),
rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE,
lapply(orig.unique, function(x) as.numeric(orig == x))))
list(time = time, df = df)
}
For n=10^5 the times were > g0(1e5)$time
user system elapsed
20.65 0.41 20.64 > g1(1e5)$time
user system elapsed
2.35 0.05 2.36 > g2(1e5)$time
user system elapsed
0.73 0.10 0.77
and the data.frames each produced were identical.
Another approach is to use outer() to make a matrix that gets passed to
data.frame(). It seems slightly slower than g2, but small changes might make it
faster.
> g3
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2 * n),
rep("33333333", 3 * n), rep("44444444", 4 * n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE, outer(orig,
orig.unique, function(x, y) as.numeric(x==y))))
list(time = time, df = df)
}
> g3(1e5)$time
user system elapsed
1.02 0.00 0.97
When you want to optimize code it is often handy to write functions like this to
do the timing for various problem sizes. You can quickly experiment with small
versions of the problem to make sure the results are correct and the time looks
reasonable and later see if the times scale up as hoped to your desired problem
size.