thr3ads.net - R help - [R] reshape to wide format takes extremely long [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Coen van Hasselt

2010-Sep-02 07:24 UTC

[R] reshape to wide format takes extremely long

Hello,

I have a data.frame with the following format:
> head(clin2)    Study Subject  Type      Obs Cycle Day       Date  Time
1 A001101   10108   ALB 44.00000    98   1 2004-03-11 14:26
2 A001101   10108   ALP 95.00000    98   1 2004-03-11 14:26
3 A001101   10108   ALT 61.00000    98   1 2004-03-11 14:26
5 A001101   10108   AST 33.00000    98   1 2004-03-11 14:26

I want to transform this data.frame so that I have "Obs" columns for
each "Type". The full dataset is 45000 rows long. For a short subset
of 100 rows, reshaping takes 0.2 seconds, and produces what I want.
All columns are either numeric or character format (incl. date/time).
> reshape(clin2, v.names="Obs", timevar="Type",
direction="wide",idvar=c("Study","Subject","Cycle","Day","Date","Time"),)      Study Subject Cycle Day       Date  Time Obs.ALB Obs.ALP Obs.ALT Obs.AST
1   A001101   10108    98   1 2004-03-11 14:26      44      95      61      33
11  A001101   10108     1   1 2004-03-12 14:01      41      85      39      33
21  A001101   10108     1   8 2004-03-22 10:34      40      90      70      34
30  A001101   10108     1  15 2004-03-29 09:56      45      97      66
     48 [........]

However, when using the same reshape command for the full data.frame
of 45000 rows, it still wasn't finished when run overnight (8 GB RAM +
8 GB swap in use).

The time to process this data.frame from a 100-row subset to a
1000-row subset increases from 0.2 sec to 60 sec.

I would greatly appreciate a advice why the time for reshaping is
increasing exponentially with the nr. of rows, and how I can do this
in an elegant way.

Thanks!

Coen.

Dennis Murphy

2010-Sep-02 09:34 UTC

head link

[R] reshape to wide format takes extremely long

Hi:

I did the following test using function ddply() in the plyr package on a toy
data frame with 50000 observations using five studies, 20 subjects per
study, 25 cycles per subject, five days per cycle and four observations by
type per day. No date-time variable was included.

# Test data frame
big <- data.frame(study = factor(rep(1:5, each = 10000)),
                  subject = factor(rep(1:100, each = 500)),
                  cycle = rep(rep(1:25, each = 20), 100),
                  day = rep(rep(1:5, each = 4), 500),
                  type = rep(c('ALB', 'ALP', 'ALT',
'AST'), 12500),
                  obs = rpois(50000, 70) )> dim(big)[1] 50000     6

# 64-bit R on a Windows 7 box with 8Gb RAM and a 2.93GHz Core Duo chip.
system.time(bigtab <- ddply(big, .(study, subject, cycle, day), function(x)
xtabs(obs ~ type, data = x)))
   user  system elapsed
  30.22    0.02   30.60
> dim(bigtab)
[1] 12500     8> head(bigtab)  study subject cycle day ALB ALP ALT AST
1     1       1     1   1  77  80  67  70
2     1       1     1   2  60  54  70  70
3     1       1     1   3  71  77  69  65
4     1       1     1   4  62  71  73  68
5     1       1     1   5  78  67  69  78
6     1       1     2   1  71  69  74  69> tail(bigtab)      study subject cycle day ALB ALP ALT AST
12495     5     100    24   5  75  83  72  70
12496     5     100    25   1  85  52  62  70
12497     5     100    25   2  79  64  84  68
12498     5     100    25   3  67  65  74  81
12499     5     100    25   4  62  86  66  80
12500     5     100    25   5  58  76  85  84

There may be an easier/more efficient way to do this with melt() and cast()
in the reshape package, but moved on when I couldn't figure it out within
ten minutes (probably because I was thinking 'xtabs of obs by type for
study/subject/cycle/day combinations - that's the ticket!' :) Packages
sqldf
and data.table are other viable options for this sort of task, and now that
there is a test data set to play with, it would be interesting to see what
else can be done. I'd be surprised if this couldn't be done within a few
seconds because the data frame is not that large.

Anyway, HTH,
Dennis



On Thu, Sep 2, 2010 at 12:24 AM, Coen van Hasselt
<coenvanhasselt@gmail.com>wrote:
> Hello,
>
> I have a data.frame with the following format:
>
> > head(clin2)
>    Study Subject  Type      Obs Cycle Day       Date  Time
> 1 A001101   10108   ALB 44.00000    98   1 2004-03-11 14:26
> 2 A001101   10108   ALP 95.00000    98   1 2004-03-11 14:26
> 3 A001101   10108   ALT 61.00000    98   1 2004-03-11 14:26
> 5 A001101   10108   AST 33.00000    98   1 2004-03-11 14:26
>
> I want to transform this data.frame so that I have "Obs" columns
for
> each "Type". The full dataset is 45000 rows long. For a short
subset
> of 100 rows, reshaping takes 0.2 seconds, and produces what I want.
> All columns are either numeric or character format (incl. date/time).
>
> > reshape(clin2, v.names="Obs", timevar="Type",
>
direction="wide",idvar=c("Study","Subject","Cycle","Day","Date","Time"),)
>      Study Subject Cycle Day       Date  Time Obs.ALB Obs.ALP Obs.ALT
> Obs.AST
> 1   A001101   10108    98   1 2004-03-11 14:26      44      95      61
>  33
> 11  A001101   10108     1   1 2004-03-12 14:01      41      85      39
>  33
> 21  A001101   10108     1   8 2004-03-22 10:34      40      90      70
>  34
> 30  A001101   10108     1  15 2004-03-29 09:56      45      97      66
>     48 [........]
>
> However, when using the same reshape command for the full data.frame
> of 45000 rows, it still wasn't finished when run overnight (8 GB RAM +
> 8 GB swap in use).
>
> The time to process this data.frame from a 100-row subset to a
> 1000-row subset increases from 0.2 sec to 60 sec.
>
> I would greatly appreciate a advice why the time for reshaping is
> increasing exponentially with the nr. of rows, and how I can do this
> in an elegant way.
>
> Thanks!
>
> Coen.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

carslaw

2010-Sep-02 10:36 UTC

head link

[R] reshape to wide format takes extremely long

picking up on Thierry's example, I don't think you need any function
because
you are just reshaping 
(not aggregating).  Therefore:

bigtab2 <- cast(data = big, study + subject + cycle + day  ~type, value
"obs")> head(bigtab2)  study subject cycle day ALB ALP ALT AST
1     1       1     1   1  66  71  83  76
2     1       1     1   2  66  87  78  58
3     1       1     1   3  72  84  78  61
4     1       1     1   4  72  63  68  69
5     1       1     1   5  64  68  89  89
6     1       1     2   1  78  65  65  76

system.time(bigtab2 <- cast(data = big, study + subject + cycle + day 
~type, value = "obs"))
   user  system elapsed 
  0.760   0.000   0.782 

david
-- 
View this message in context:
http://r.789695.n4.nabble.com/reshape-to-wide-format-takes-extremely-long-tp2487153p2506575.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more reasonably related threads

R help - Sep 2010 - reshape to wide format takes extremely long

[R] reshape to wide format takes extremely long

[R] reshape to wide format takes extremely long

[R] reshape to wide format takes extremely long

Possibly Parallel Threads