thr3ads.net - R help - [R] From long to wide format [Jun 2014]

If this information is useful, please help other people find it:
Share via:

Jorge I Velez

2014-Jun-30 03:46 UTC

[R] From long to wide format

Dear R-help,

I am working with some data stored as "filename.txt.gz" in my working
directory.
After reading the data in using read.table(), I can see that each of them
has four columns (variable, id, outcome, and rate) and the following
structure:

# sample data
x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id
rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate
runif(10000, 0.5, 1))
str(x2)

Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
number of unique IDs (100 in this example).  What I would like to do is to
transform the data above
in a long format.  I can do this by using

# reshape
wide <- reshape(x2[, -4], v.names = "outcome", idvar =
"id",
                timevar = "variable", direction = "wide")
str(wide)

# or a "hack" with mclapply:

require(parallel)
sel <- as.character(unique(x2$variable))
id <- as.character(unique(x2$id))
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
                        out <- x2[x2$variable == sel[i], ][, 3]
                        }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X

However, I was wondering if it is possible to come up with another solution
, hopefully faster than these
.  Unfortunately, either one of these takes a very long time to process,
specially when the number of variables is very large
(> 250,000) and the number of ids is ~2000.

I would very much appreciate your suggestions.   At the end of this message
is my sessionInfo().

Thank you very much in advance.

Best regards,
Jorge Velez.-


R>  sessionInfo()

R version 3.0.2 Patched (2013-12-11 r64449)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  parallel  compiler  stats
[8] methods   base

other attached packages:
[1] knitr_1.6.3            ggplot2_1.0.0          slidifyLibraries_0.3.1
[4] slidify_0.3.52

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10
 [5] grid_3.0.2       gtable_0.1.2     markdown_0.7.1   MASS_7.3-33
 [9] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
[13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.0.2
[17] whisker_0.4      yaml_2.1.13

	[[alternative HTML version deleted]]

arun

2014-Jun-30 08:51 UTC

head link

[R] From long to wide format

Hi Jorge,

You may try:
library(dplyr)
library(tidyr)

#Looks like this is faster than the other methods.
system.time({wide1 <- x2%>%
??? ??? select(-rate) %>% 
??? ??? mutate(variable=factor(variable, levels=unique(variable)),id=factor(id,
levels=unique(id))) %>%?????????????????
??????????? spread(variable,outcome)
colnames(wide1)[-1] <-
paste("outcome",colnames(wide1)[-1],sep=".")
})

?#user? system elapsed 
?#? 0.006??? 0.00??? 0.006



system.time(wide <- reshape(x2[, -4], v.names = "outcome", idvar =
"id",
?????????????? timevar = "variable", direction = "wide"))
?#user? system elapsed 
?# 0.169?? 0.000?? 0.169



system.time({
sel <- unique(x2$variable)
id <- unique(x2$id)
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
??????????????????????? out <- x2[x2$variable == sel[i], ][, 3]
??????????????????????? }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X
})

# user? system elapsed 
#? 0.125?? 0.011?? 0.074


?wide2 <- wide1
wide2$id <- as.character(wide2$id)
?wide$id <- as.character(wide$id)
all.equal(wide, wide2, check.attributes=F)
#[1] TRUE

A.K.



On Sunday, June 29, 2014 11:48 PM, Jorge I Velez <jorgeivanvelez at
gmail.com> wrote:
Dear R-help,

I am working with some data stored as "filename.txt.gz" in my working
directory.
After reading the data in using read.table(), I can see that each of them
has four columns (variable, id, outcome, and rate) and the following
structure:

# sample data
x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id
rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate
runif(10000, 0.5, 1))
str(x2)

Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
number of unique IDs (100 in this example).? What I would like to do is to
transform the data above
in a long format.? I can do this by using

# reshape
wide <- reshape(x2[, -4], v.names = "outcome", idvar =
"id",
? ? ? ? ? ? ? ? timevar = "variable", direction = "wide")
str(wide)

# or a "hack" with mclapply:

require(parallel)
sel <- as.character(unique(x2$variable))
id <- as.character(unique(x2$id))
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
? ? ? ? ? ? ? ? ? ? ? ? out <- x2[x2$variable == sel[i], ][, 3]
? ? ? ? ? ? ? ? ? ? ? ? }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X

However, I was wondering if it is possible to come up with another solution
, hopefully faster than these
.? Unfortunately, either one of these takes a very long time to process,
specially when the number of variables is very large
(> 250,000) and the number of ids is ~2000.

I would very much appreciate your suggestions.?  At the end of this message
is my sessionInfo().

Thank you very much in advance.

Best regards,
Jorge Velez.-


R>? sessionInfo()

R version 3.0.2 Patched (2013-12-11 r64449)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] graphics? grDevices utils? ?  datasets? parallel? compiler? stats
[8] methods?  base

other attached packages:
[1] knitr_1.6.3? ? ? ? ? ? ggplot2_1.0.0? ? ? ? ? slidifyLibraries_0.3.1
[4] slidify_0.3.52

loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4? ?  evaluate_0.5.5?  formatR_0.10
[5] grid_3.0.2? ? ?  gtable_0.1.2? ?  markdown_0.7.1?  MASS_7.3-33
[9] munsell_0.4.2? ? plyr_1.8.1? ? ?  proto_0.3-10? ?  Rcpp_0.11.2
[13] reshape2_1.4? ?  scales_0.2.4? ?  stringr_0.6.2? ? tools_3.0.2
[17] whisker_0.4? ? ? yaml_2.1.13

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Jun 2014 - From long to wide format

[R] From long to wide format

[R] From long to wide format