Hi:
I did the following test using function ddply() in the plyr package on a toy
data frame with 50000 observations using five studies, 20 subjects per
study, 25 cycles per subject, five days per cycle and four observations by
type per day. No date-time variable was included.
# Test data frame
big <- data.frame(study = factor(rep(1:5, each = 10000)),
subject = factor(rep(1:100, each = 500)),
cycle = rep(rep(1:25, each = 20), 100),
day = rep(rep(1:5, each = 4), 500),
type = rep(c('ALB', 'ALP', 'ALT',
'AST'), 12500),
obs = rpois(50000, 70) )> dim(big)
[1] 50000 6
# 64-bit R on a Windows 7 box with 8Gb RAM and a 2.93GHz Core Duo chip.
system.time(bigtab <- ddply(big, .(study, subject, cycle, day), function(x)
xtabs(obs ~ type, data = x)))
user system elapsed
30.22 0.02 30.60
> dim(bigtab)
[1] 12500 8> head(bigtab)
study subject cycle day ALB ALP ALT AST
1 1 1 1 1 77 80 67 70
2 1 1 1 2 60 54 70 70
3 1 1 1 3 71 77 69 65
4 1 1 1 4 62 71 73 68
5 1 1 1 5 78 67 69 78
6 1 1 2 1 71 69 74 69> tail(bigtab)
study subject cycle day ALB ALP ALT AST
12495 5 100 24 5 75 83 72 70
12496 5 100 25 1 85 52 62 70
12497 5 100 25 2 79 64 84 68
12498 5 100 25 3 67 65 74 81
12499 5 100 25 4 62 86 66 80
12500 5 100 25 5 58 76 85 84
There may be an easier/more efficient way to do this with melt() and cast()
in the reshape package, but moved on when I couldn't figure it out within
ten minutes (probably because I was thinking 'xtabs of obs by type for
study/subject/cycle/day combinations - that's the ticket!' :) Packages
sqldf
and data.table are other viable options for this sort of task, and now that
there is a test data set to play with, it would be interesting to see what
else can be done. I'd be surprised if this couldn't be done within a few
seconds because the data frame is not that large.
Anyway, HTH,
Dennis
On Thu, Sep 2, 2010 at 12:24 AM, Coen van Hasselt
<coenvanhasselt@gmail.com>wrote:
> Hello,
>
> I have a data.frame with the following format:
>
> > head(clin2)
> Study Subject Type Obs Cycle Day Date Time
> 1 A001101 10108 ALB 44.00000 98 1 2004-03-11 14:26
> 2 A001101 10108 ALP 95.00000 98 1 2004-03-11 14:26
> 3 A001101 10108 ALT 61.00000 98 1 2004-03-11 14:26
> 5 A001101 10108 AST 33.00000 98 1 2004-03-11 14:26
>
> I want to transform this data.frame so that I have "Obs" columns
for
> each "Type". The full dataset is 45000 rows long. For a short
subset
> of 100 rows, reshaping takes 0.2 seconds, and produces what I want.
> All columns are either numeric or character format (incl. date/time).
>
> > reshape(clin2, v.names="Obs", timevar="Type",
>
direction="wide",idvar=c("Study","Subject","Cycle","Day","Date","Time"),)
> Study Subject Cycle Day Date Time Obs.ALB Obs.ALP Obs.ALT
> Obs.AST
> 1 A001101 10108 98 1 2004-03-11 14:26 44 95 61
> 33
> 11 A001101 10108 1 1 2004-03-12 14:01 41 85 39
> 33
> 21 A001101 10108 1 8 2004-03-22 10:34 40 90 70
> 34
> 30 A001101 10108 1 15 2004-03-29 09:56 45 97 66
> 48 [........]
>
> However, when using the same reshape command for the full data.frame
> of 45000 rows, it still wasn't finished when run overnight (8 GB RAM +
> 8 GB swap in use).
>
> The time to process this data.frame from a 100-row subset to a
> 1000-row subset increases from 0.2 sec to 60 sec.
>
> I would greatly appreciate a advice why the time for reshaping is
> increasing exponentially with the nr. of rows, and how I can do this
> in an elegant way.
>
> Thanks!
>
> Coen.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]