Hi Daniele,
One possibility would be to make two runs. In the first run you are not building
the matrix but just calculating the number of rows you need (in a loop). Then
you allocate such matrix (only once) and fill it in the second run.
Regards,
Moshe.
--- On Wed, 24/2/10, Daniele Amberti <daniele.amberti at ors.it> wrote:
> From: Daniele Amberti <daniele.amberti at ors.it>
> Subject: [R] Optimise huge data.frame construction
> To: "r-help at r-project.org" <r-help at r-project.org>
> Received: Wednesday, 24 February, 2010, 8:34 PM
> I have data for different items (ID)
> in a database.
> For each ID I have to get:
>
> -? ? ? ? ? Timestamp of the
> observation (timestamp);
>
> -? ? ? ? ? numerical value (val)
> that will be my response variable in some kind of model;
>
> -? ? ? ? ? a variable number of
> variables in a know set (if value for a specific variable is
> not present in DB it is 0).
>
> To get to the above mentioned values I have to cycle over
> IDs, make some calculation and store results to construct a
> huge data.frame for subsequent estimations. The number of
> rows for each ID is random (typically 14 to 200).
>
> My current approach is to construct a matrix like this:
>
> out <- c('A', 'B', 'C', 'D')
> out <- matrix(-1, 5000, 3 + length(out), dimnames > list(1:5000,
c('ID', 'timestamp' , 'val', out)))
>
> I access to out matrix by numerical index to substitute
> values ( out[1:n,1] <- k )
> When matrix is full I add 5000 rows and go on.
> Afterward I clean rows with ID set to -1 and than all other
> -1 values with 0
>
> For my application typically an ID have something between
> 14 and 200 observations (mean around 50) but I have 15000
> IDs ...
> After profiling I realize that accessing the out matrix
> this way is too slow.
>
> Do you have any idea on how to speed up this kind of
> process?
> I think something can be done creating a data.frame for
> each ID and bind them in the end. Is it a good idea? How can
> I implement that? List of data.frame? And than?
>
> Below some code that can be useful if someone would like to
> experiment ...
>
> alist <- vector('list', 2)
> alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val
> = rnorm(14), A = 1, B = 2, C = 3 )
> alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val
> = rnorm(14), B = 2, C = 3, D = 4 )
> alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val
> = rnorm(28), C = 1, D = 2 )
>
>
> Thanks in advance for your valuable help.
> Daniele
>
> ________________________________
> ORS Srl
>
> Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
> Tel. +39 0173 620211
> Fax. +39 0173 620299 / +39 0173 433111
> Web Site www.ors.it
>
>
------------------------------------------------------------------------------------------------------------------------
> Qualsiasi utilizzo non autorizzato del presente messaggio e
> dei suoi allegati ? vietato e potrebbe costituire reato.
> Se lei avesse ricevuto erroneamente questo messaggio, Le
> saremmo grati se provvedesse alla distruzione dello stesso
> e degli eventuali allegati.
> Opinioni, conclusioni o altre informazioni riportate nella
> e-mail, che non siano relative alle attivit? e/o
> alla missione aziendale di O.R.S. Srl si intendono non
> attribuibili alla societ? stessa, n? la impegnano in alcun
> modo.
>
> ??? [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>