Hi, I am experiencing a long delay when using dataframes inside loops and was wordering if this is a bug or not. Example code:> st <- rep(1,100000) > ed <- rep(2,100000) > for(i in 1:length(st)) st[i] <- ed[i] # works fine > df <- data.frame(start=st,end=ed) > for(i in 1:dim(df)[1]) df[i,1] <- df[i,2] #takes for everR: R 2.0.0 (2004-10-04) OS: Linux, Fedora Core 2 kernel: 2.6.10-1.14_FC2 cpu: AMD Athlon XP 1600. mem: 500MB. The example above is only to illustrate the problem. I need loops to apply some functions on pairs (not necessarily successive) of rows in a dataframe. Thankful for any advices, Firas.
You are discovering part of the overhead of using a data frame. The way you specify the subset of data frame to replace matters somewhat:> st <- rep(1,1e4) > ed <- rep(2,1e4) > df <- data.frame(start=st, end=ed) > system.time(for (i in 1:dim(df)[1]) df[i,1] <- df[i,2], gcFirst=TRUE)[1] 35.96 0.10 36.37 NA NA> df <- data.frame(start=st, end=ed) > system.time(for (i in 1:dim(df)[1]) df[[1]][i] <- df[[2]][i],gcFirst=TRUE) [1] 22.63 0.17 22.88 NA NA> df <- data.frame(start=st, end=ed) > system.time(for (i in 1:dim(df)[1]) df$start[i] <- df$end[i],gcFirst=TRUE) [1] 19.29 0.13 19.46 NA NA If you have all numeric data, you might as well use a matrix instead of data frame:> m <- cbind(start=st, end=ed) > str(m)num [1:10000, 1:2] 2 2 2 2 2 2 2 2 2 2 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "start" "end"> system.time(for (i in 1:nrow(df)) m[i,1] <- m[i,2], gcFirst=TRUE)[1] 0.06 0.00 0.08 NA NA Andy> From: Firas Swidan > > Hi, > I am experiencing a long delay when using dataframes inside > loops and was > wordering if this is a bug or not. > Example code: > > > st <- rep(1,100000) > > ed <- rep(2,100000) > > for(i in 1:length(st)) st[i] <- ed[i] # works fine > > df <- data.frame(start=st,end=ed) > > for(i in 1:dim(df)[1]) df[i,1] <- df[i,2] #takes for ever > > R: R 2.0.0 (2004-10-04) > OS: Linux, Fedora Core 2 > kernel: 2.6.10-1.14_FC2 > cpu: AMD Athlon XP 1600. > mem: 500MB. > > The example above is only to illustrate the problem. I need > loops to apply > some functions on pairs (not necessarily successive) of rows in a > dataframe. > > Thankful for any advices, > Firas. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
An addendum: If you must use a data frame (e.g., you have mixed data types), the following might help:> df <- list(start=st, end=ed) > system.time({for (i in 1:length(df[[1]])) df$start[i] <- df$end[i];+ df <- as.data.frame(df)}, gcFirst=TRUE) [1] 0.14 0.01 0.15 NA NA I.e., keep it as a list until all manipulations are done, then coerce to data frame. Andy> From: Liaw, Andy > > You are discovering part of the overhead of using a data > frame. The way you > specify the subset of data frame to replace matters somewhat: > > > st <- rep(1,1e4) > > ed <- rep(2,1e4) > > df <- data.frame(start=st, end=ed) > > system.time(for (i in 1:dim(df)[1]) df[i,1] <- df[i,2], > gcFirst=TRUE) > [1] 35.96 0.10 36.37 NA NA > > df <- data.frame(start=st, end=ed) > > system.time(for (i in 1:dim(df)[1]) df[[1]][i] <- df[[2]][i], > gcFirst=TRUE) > [1] 22.63 0.17 22.88 NA NA > > df <- data.frame(start=st, end=ed) > > system.time(for (i in 1:dim(df)[1]) df$start[i] <- df$end[i], > gcFirst=TRUE) > [1] 19.29 0.13 19.46 NA NA > > > If you have all numeric data, you might as well use a matrix > instead of data > frame: > > > m <- cbind(start=st, end=ed) > > str(m) > num [1:10000, 1:2] 2 2 2 2 2 2 2 2 2 2 ... > - attr(*, "dimnames")=List of 2 > ..$ : NULL > ..$ : chr [1:2] "start" "end" > > system.time(for (i in 1:nrow(df)) m[i,1] <- m[i,2], gcFirst=TRUE) > [1] 0.06 0.00 0.08 NA NA > > > Andy > > > > From: Firas Swidan > > > > Hi, > > I am experiencing a long delay when using dataframes inside > > loops and was > > wordering if this is a bug or not. > > Example code: > > > > > st <- rep(1,100000) > > > ed <- rep(2,100000) > > > for(i in 1:length(st)) st[i] <- ed[i] # works fine > > > df <- data.frame(start=st,end=ed) > > > for(i in 1:dim(df)[1]) df[i,1] <- df[i,2] #takes for ever > > > > R: R 2.0.0 (2004-10-04) > > OS: Linux, Fedora Core 2 > > kernel: 2.6.10-1.14_FC2 > > cpu: AMD Athlon XP 1600. > > mem: 500MB. > > > > The example above is only to illustrate the problem. I need > > loops to apply > > some functions on pairs (not necessarily successive) of rows in a > > dataframe. > > > > Thankful for any advices, > > Firas. > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > -------------------------------------------------------------- > ---------------- > Notice: This e-mail message, together with any attachments, > contains information of Merck & Co., Inc. (One Merck Drive, > Whitehouse Station, New Jersey, USA 08889), and/or its > affiliates (which may be known outside the United States as > Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > Banyu) that may be confidential, proprietary copyrighted > and/or legally privileged. It is intended solely for the use > of the individual or entity named on this message. If you > are not the intended recipient, and have received this > message in error, please notify us immediately by reply > e-mail and then delete it from your system. > -------------------------------------------------------------- > ---------------- >
On Feb 25, 2005, at 6:06 AM, Firas Swidan wrote:> Hi, > I am experiencing a long delay when using dataframes inside loops and > was > wordering if this is a bug or not. > Example code: > >> st <- rep(1,100000) >> ed <- rep(2,100000) >> for(i in 1:length(st)) st[i] <- ed[i] # works fine >> df <- data.frame(start=st,end=ed) >> for(i in 1:dim(df)[1]) df[i,1] <- df[i,2] #takes for ever > > R: R 2.0.0 (2004-10-04) > OS: Linux, Fedora Core 2 > kernel: 2.6.10-1.14_FC2 > cpu: AMD Athlon XP 1600. > mem: 500MB. > > The example above is only to illustrate the problem. I need loops to > apply > some functions on pairs (not necessarily successive) of rows in a > dataframe.I'm not an expert, but working with dataframes is typically slower than the eqivalent matrix. If it is possible (the data is of the same type, as it is above), working with the equivalent matrix is prabably faster. So, I think the general answer to the implied question is that dataframe processing is slower than vector processing or the equivalent matrix processing. If you post more details about your specific problem, folks may be able to find creative ways of speeding things up, if speed remains a concern. Sean