thr3ads.net - R help - [R] Optimise huge data.frame construction [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Daniele Amberti

2010-Feb-24 09:34 UTC

[R] Optimise huge data.frame construction

I have data for different items (ID) in a database.
For each ID I have to get:

-          Timestamp of the observation (timestamp);

-          numerical value (val) that will be my response variable in some kind
of model;

-          a variable number of variables in a know set (if value for a specific
variable is not present in DB it is 0).

To get to the above mentioned values I have to cycle over IDs, make some
calculation and store results to construct a huge data.frame for subsequent
estimations. The number of rows for each ID is random (typically 14 to 200).

My current approach is to construct a matrix like this:

out <- c('A', 'B', 'C', 'D')
out <- matrix(-1, 5000, 3 + length(out), dimnames = list(1:5000,
c('ID', 'timestamp' , 'val', out)))

I access to out matrix by numerical index to substitute values ( out[1:n,1]
<- k )
When matrix is full I add 5000 rows and go on.
Afterward I clean rows with ID set to -1 and than all other -1 values with 0

For my application typically an ID have something between 14 and 200
observations (mean around 50) but I have 15000 IDs ...
After profiling I realize that accessing the out matrix this way is too slow.

Do you have any idea on how to speed up this kind of process?
I think something can be done creating a data.frame for each ID and bind them in
the end. Is it a good idea? How can I implement that? List of data.frame? And
than?

Below some code that can be useful if someone would like to experiment ...

alist <- vector('list', 2)
alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val = rnorm(14), A = 1, B
= 2, C = 3 )
alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val = rnorm(14), B = 2, C
= 3, D = 4 )
alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val = rnorm(28), C = 1, D
= 2 )


Thanks in advance for your valuable help.
Daniele

________________________________
ORS Srl

Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
Tel. +39 0173 620211
Fax. +39 0173 620299 / +39 0173 433111
Web Site www.ors.it

------------------------------------------------------------------------------------------------------------------------
Qualsiasi utilizzo non autorizzato del presente messaggio e dei suoi allegati ?
vietato e potrebbe costituire reato.
Se lei avesse ricevuto erroneamente questo messaggio, Le saremmo grati se
provvedesse alla distruzione dello stesso
e degli eventuali allegati.
Opinioni, conclusioni o altre informazioni riportate nella e-mail, che non siano
relative alle attivit? e/o
alla missione aziendale di O.R.S. Srl si intendono non attribuibili alla societ?
stessa, n? la impegnano in alcun modo.

	[[alternative HTML version deleted]]

Moshe Olshansky

2010-Feb-24 10:09 UTC

head link

[R] Optimise huge data.frame construction

Hi Daniele,

One possibility would be to make two runs. In the first run you are not building
the matrix but just calculating the number of rows you need (in a loop). Then
you allocate such matrix (only once) and fill it in the second run.

Regards,
Moshe.

--- On Wed, 24/2/10, Daniele Amberti <daniele.amberti at ors.it> wrote:
> From: Daniele Amberti <daniele.amberti at ors.it>
> Subject: [R] Optimise huge data.frame construction
> To: "r-help at r-project.org" <r-help at r-project.org>
> Received: Wednesday, 24 February, 2010, 8:34 PM
> I have data for different items (ID)
> in a database.
> For each ID I have to get:
> 
> -? ? ? ? ? Timestamp of the
> observation (timestamp);
> 
> -? ? ? ? ? numerical value (val)
> that will be my response variable in some kind of model;
> 
> -? ? ? ? ? a variable number of
> variables in a know set (if value for a specific variable is
> not present in DB it is 0).
> 
> To get to the above mentioned values I have to cycle over
> IDs, make some calculation and store results to construct a
> huge data.frame for subsequent estimations. The number of
> rows for each ID is random (typically 14 to 200).
> 
> My current approach is to construct a matrix like this:
> 
> out <- c('A', 'B', 'C', 'D')
> out <- matrix(-1, 5000, 3 + length(out), dimnames > list(1:5000,
c('ID', 'timestamp' , 'val', out)))
> 
> I access to out matrix by numerical index to substitute
> values ( out[1:n,1] <- k )
> When matrix is full I add 5000 rows and go on.
> Afterward I clean rows with ID set to -1 and than all other
> -1 values with 0
> 
> For my application typically an ID have something between
> 14 and 200 observations (mean around 50) but I have 15000
> IDs ...
> After profiling I realize that accessing the out matrix
> this way is too slow.
> 
> Do you have any idea on how to speed up this kind of
> process?
> I think something can be done creating a data.frame for
> each ID and bind them in the end. Is it a good idea? How can
> I implement that? List of data.frame? And than?
> 
> Below some code that can be useful if someone would like to
> experiment ...
> 
> alist <- vector('list', 2)
> alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val
> = rnorm(14), A = 1, B = 2, C = 3 )
> alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val
> = rnorm(14), B = 2, C = 3, D = 4 )
> alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val
> = rnorm(28), C = 1, D = 2 )
> 
> 
> Thanks in advance for your valuable help.
> Daniele
> 
> ________________________________
> ORS Srl
> 
> Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
> Tel. +39 0173 620211
> Fax. +39 0173 620299 / +39 0173 433111
> Web Site www.ors.it
> 
>
------------------------------------------------------------------------------------------------------------------------
> Qualsiasi utilizzo non autorizzato del presente messaggio e
> dei suoi allegati ? vietato e potrebbe costituire reato.
> Se lei avesse ricevuto erroneamente questo messaggio, Le
> saremmo grati se provvedesse alla distruzione dello stesso
> e degli eventuali allegati.
> Opinioni, conclusioni o altre informazioni riportate nella
> e-mail, che non siano relative alle attivit? e/o
> alla missione aziendale di O.R.S. Srl si intendono non
> attribuibili alla societ? stessa, n? la impegnano in alcun
> modo.
> 
> ??? [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Feb 2010 - Optimise huge data.frame construction

[R] Optimise huge data.frame construction

[R] Optimise huge data.frame construction

Possibly Parallel Threads