thr3ads.net - R help - [R] adding rows without loops [May 2013]

If this information is useful, please help other people find it:
Share via:

Adeel Amin

2013-May-23 05:00 UTC

[R] adding rows without loops

I'm comparing a variety of datasets with over 4M rows.  I've solved this
problem 5 different ways using a for/while loop but the processing time is
murder (over 8 hours doing this row by row per data set).  As such I'm
trying to find whether this solution is possible without a loop or one in
which the processing time is much faster.

Each dataset is a time series as such:

DF1:

    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0700    45     35
6 01052007   0800    42     32
7 01052007   0900    45     32
.
.
.
n

DF2

    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0600    45     35
6 01052007   0700    42     32
7 01052007   0800    45     32

.
.
n+4000

In other words there are 4000 more rows in DF2 then DF1 thus the datasets
are of unequal length.

I'm trying to ensure that all dataframes have the same number of X.DATE and
X.TIME entries.  Where they are missing, I'd like to insert a new row.

In the above example, when comparing DF2 to DF1, entry 01052007 0600 entry
is missing in DF1.  The solution would add a row to DF1 at the appropriate
index.

so new dataframe would be


    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0600    45     27
6 01052007   0700    45     35
7 01052007   0800    42     32
8 01052007   0900    45     32

Value and Value2 would be the same as row 4.

Of course this is simple to accomplish using a row by row analysis but with
of 4M rows the processing time destroying and rebinding the datasets is
very time consuming and I believe highly un-R'ish.  What am I missing?

Thanks!

	[[alternative HTML version deleted]]

Blaser Nello

2013-May-23 07:14 UTC

head link

[R] adding rows without loops

Merge should do the trick. How to best use it will depend on what you
want to do with the data after. 
The following is an example of what you could do. This will perform
best, if the rows are missing at random and do not cluster.

DF1 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))
DF2 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))

DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"),
all=TRUE)

while(any(is.na(DFm))){
  if (any(is.na(DFm[1,]))) stop("Complete first row required!")
  ind <- which(is.na(DFm), arr.ind=TRUE)
  prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2)
  DFm[is.na(DFm)] <- DFm[prind]
}
DFm

Best,
Nello

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Adeel Amin
Sent: Donnerstag, 23. Mai 2013 07:01
To: r-help at r-project.org
Subject: [R] adding rows without loops

I'm comparing a variety of datasets with over 4M rows.  I've solved this
problem 5 different ways using a for/while loop but the processing time
is murder (over 8 hours doing this row by row per data set).  As such
I'm trying to find whether this solution is possible without a loop or
one in which the processing time is much faster.

Each dataset is a time series as such:

DF1:

    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0700    45     35
6 01052007   0800    42     32
7 01052007   0900    45     32
...
...
...
n

DF2

    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0600    45     35
6 01052007   0700    42     32
7 01052007   0800    45     32

...
...
n+4000

In other words there are 4000 more rows in DF2 then DF1 thus the
datasets are of unequal length.

I'm trying to ensure that all dataframes have the same number of X.DATE
and X.TIME entries.  Where they are missing, I'd like to insert a new
row.

In the above example, when comparing DF2 to DF1, entry 01052007 0600
entry is missing in DF1.  The solution would add a row to DF1 at the
appropriate index.

so new dataframe would be


    X.DATE X.TIME VALUE VALUE2
1 01052007   0200    37     29
2 01052007   0300    42     24
3 01052007   0400    45     28
4 01052007   0500    45     27
5 01052007   0600    45     27
6 01052007   0700    45     35
7 01052007   0800    42     32
8 01052007   0900    45     32

Value and Value2 would be the same as row 4.

Of course this is simple to accomplish using a row by row analysis but
with of 4M rows the processing time destroying and rebinding the
datasets is very time consuming and I believe highly un-R'ish.  What am
I missing?

Thanks!

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - May 2013 - adding rows without loops

[R] adding rows without loops

[R] adding rows without loops