Gavin Simpson
2006-Nov-30 17:00 UTC
[R] Quicker way of combining vectors into a data.frame
Hi, In a function, I compute 10 (un-named) vectors of reasonable length (4471 in the particular example I have to hand) that I want to combine into a data frame object, that the function will return. This is very slow, so *I'm* doing something wrong if I want it to be quick and efficient, though I'm not sure what the best way to do this would be. I know it is the combining into data frame bit that is slow, because I've Rprof'ed it: $by.self self.time self.pct total.time total.pct "names<-.default" 16.58 52.8 16.58 52.8 "unlist" 7.22 23.0 7.26 23.1 "data.frame" 1.72 5.5 29.38 93.6 "duplicated.default" 1.66 5.3 1.66 5.3 "+" 1.20 3.8 1.20 3.8 "list" 0.40 1.3 0.40 1.3 "as.data.frame.numeric" 0.28 0.9 3.32 10.6 "apply" 0.26 0.8 1.70 5.4 "pmatch" 0.22 0.7 0.22 0.7 "paste" 0.20 0.6 0.90 2.9 "deparse" 0.14 0.4 0.70 2.2 "eval" 0.12 0.4 31.28 99.7 "names<-" 0.12 0.4 16.70 53.2 "FUN" 0.12 0.4 1.32 4.2 "names" 0.12 0.4 0.14 0.4 "as.list.default" 0.12 0.4 0.12 0.4 "duplicated" 0.10 0.3 1.76 5.6 "gc" 0.10 0.3 0.10 0.3 And I stepped through it under debug() and all the calculations before are quick, and then this bit takes a little over 20 seconds to complete fab <- data.frame(lc.ratio = lc.ratio, Q = Q, fNupt = fNupt, rho.n = rho.n, rho.s = rho.s, net.Nimm = net.Nimm, net.Nden = net.Nden, CLminN = CLminN, CLmaxN = CLmaxN, CLmaxS = CLmaxS) I can get it down to c. 5 seconds if I do (not Rprof'ed): fab <- data.frame(lc.ratio, Q, fNupt, rho.n, rho.s, net.Nimm, net.Nden, CLminN, CLmaxN, CLmaxS) But this still seems quite a long time, so I'm thinking that there must be a quicker of doing what I want (end up with a data.frame with the 10 vectors in it). Can anyone enlighten me?> version_ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Patched major 2 minor 4.0 year 2006 month 10 day 03 svn rev 39576 language R version.string R version 2.4.0 Patched (2006-10-03 r39576)> sessionInfo()R version 2.4.0 Patched (2006-10-03 r39576) i686-pc-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" [7] "base" Thanks in advance, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC & ENSIS, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Marc Schwartz
2006-Nov-30 17:25 UTC
[R] Quicker way of combining vectors into a data.frame
On Thu, 2006-11-30 at 17:00 +0000, Gavin Simpson wrote:> Hi, > > In a function, I compute 10 (un-named) vectors of reasonable length > (4471 in the particular example I have to hand) that I want to combine > into a data frame object, that the function will return. > > This is very slow, so *I'm* doing something wrong if I want it to be > quick and efficient, though I'm not sure what the best way to do this > would be. > > I know it is the combining into data frame bit that is slow, because > I've Rprof'ed it: > > $by.self > self.time self.pct total.time total.pct > "names<-.default" 16.58 52.8 16.58 52.8 > "unlist" 7.22 23.0 7.26 23.1 > "data.frame" 1.72 5.5 29.38 93.6 > "duplicated.default" 1.66 5.3 1.66 5.3 > "+" 1.20 3.8 1.20 3.8 > "list" 0.40 1.3 0.40 1.3 > "as.data.frame.numeric" 0.28 0.9 3.32 10.6 > "apply" 0.26 0.8 1.70 5.4 > "pmatch" 0.22 0.7 0.22 0.7 > "paste" 0.20 0.6 0.90 2.9 > "deparse" 0.14 0.4 0.70 2.2 > "eval" 0.12 0.4 31.28 99.7 > "names<-" 0.12 0.4 16.70 53.2 > "FUN" 0.12 0.4 1.32 4.2 > "names" 0.12 0.4 0.14 0.4 > "as.list.default" 0.12 0.4 0.12 0.4 > "duplicated" 0.10 0.3 1.76 5.6 > "gc" 0.10 0.3 0.10 0.3 > > And I stepped through it under debug() and all the calculations before > are quick, and then this bit takes a little over 20 seconds to complete > > fab <- data.frame(lc.ratio = lc.ratio, Q = Q, > fNupt = fNupt, > rho.n = rho.n, rho.s = rho.s, > net.Nimm = net.Nimm, > net.Nden = net.Nden, > CLminN = CLminN, > CLmaxN = CLmaxN, > CLmaxS = CLmaxS) > > I can get it down to c. 5 seconds if I do (not Rprof'ed): > > fab <- data.frame(lc.ratio, Q, > fNupt, > rho.n, rho.s, > net.Nimm, > net.Nden, > CLminN, > CLmaxN, > CLmaxS) > > But this still seems quite a long time, so I'm thinking that there must > be a quicker of doing what I want (end up with a data.frame with the 10 > vectors in it). > > Can anyone enlighten me?I am imputing from the above, that the 10 columns are all numeric as there seems to be time spent in the column naming process (the lack of which speeds up your second example), as well as the use of as.data.frame.numeric() and related activities. It is not clear, if this is correct, why you want a dataframe as opposed to a numeric matrix, but in either case: If we have 10 vectors, named Colx, where x is 1:10 and each vector is:> str(Col1)num [1:4471] 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... Then:> system.time(Mat <- cbind(Col1, Col2, Col3, Col4, Col5, Col6, Col7,Col8, Col9, Col10)) [1] 0.002 0.000 0.001 0.000 0.000 Or:> system.time(DF <- as.data.frame(cbind(Col1, Col2, Col3, Col4, Col5,Col6, Col7, Col8, Col9, Col10))) [1] 0.005 0.000 0.005 0.000 0.000 You can then add colnames() subsequent to the cbind()ing:> system.time(colnames(Mat) <- c("lc.ratio", "Q", "fNupt", "rho.n","rho.s", "net.Nimm", "net.Nden", "CLminN", "CLmaxN", "CLmaxS")) [1] 0.002 0.000 0.001 0.000 0.000> system.time(colnames(DF) <- c("lc.ratio", "Q", "fNupt", "rho.n","rho.s", "net.Nimm", "net.Nden", "CLminN", "CLmaxN", "CLmaxS")) [1] 0.011 0.000 0.020 0.000 0.000> str(Mat)num [1:4471, 1:10] 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:10] "lc.ratio" "Q" "fNupt" "rho.n" ...> str(DF)'data.frame': 4471 obs. of 10 variables: $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ... $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ... $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ... $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ... $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ... $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ... $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ... $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ... $ CLmaxS : num -0.520 0.278 -0.546 -0.925 1.507 ... HTH, Marc Schwartz
Prof Brian Ripley
2006-Nov-30 17:28 UTC
[R] Quicker way of combining vectors into a data.frame
If you are prepared to give up most of the sanity checks, see this at the bottom of read.table: ## this is extremely underhanded ## we should use the constructor function ... ## don't try this at home kids class(data) <- "data.frame" row.names(data) <- row.names data So create a (named?) list with your vectors in it, assign class "data.frame" and then row.names(data) <- NULL On Thu, 30 Nov 2006, Gavin Simpson wrote:> Hi, > > In a function, I compute 10 (un-named) vectors of reasonable length > (4471 in the particular example I have to hand) that I want to combine > into a data frame object, that the function will return. > > This is very slow, so *I'm* doing something wrong if I want it to be > quick and efficient, though I'm not sure what the best way to do this > would be. > > I know it is the combining into data frame bit that is slow, because > I've Rprof'ed it: > > $by.self > self.time self.pct total.time total.pct > "names<-.default" 16.58 52.8 16.58 52.8 > "unlist" 7.22 23.0 7.26 23.1 > "data.frame" 1.72 5.5 29.38 93.6 > "duplicated.default" 1.66 5.3 1.66 5.3 > "+" 1.20 3.8 1.20 3.8 > "list" 0.40 1.3 0.40 1.3 > "as.data.frame.numeric" 0.28 0.9 3.32 10.6 > "apply" 0.26 0.8 1.70 5.4 > "pmatch" 0.22 0.7 0.22 0.7 > "paste" 0.20 0.6 0.90 2.9 > "deparse" 0.14 0.4 0.70 2.2 > "eval" 0.12 0.4 31.28 99.7 > "names<-" 0.12 0.4 16.70 53.2 > "FUN" 0.12 0.4 1.32 4.2 > "names" 0.12 0.4 0.14 0.4 > "as.list.default" 0.12 0.4 0.12 0.4 > "duplicated" 0.10 0.3 1.76 5.6 > "gc" 0.10 0.3 0.10 0.3 > > And I stepped through it under debug() and all the calculations before > are quick, and then this bit takes a little over 20 seconds to complete > > fab <- data.frame(lc.ratio = lc.ratio, Q = Q, > fNupt = fNupt, > rho.n = rho.n, rho.s = rho.s, > net.Nimm = net.Nimm, > net.Nden = net.Nden, > CLminN = CLminN, > CLmaxN = CLmaxN, > CLmaxS = CLmaxS) > > I can get it down to c. 5 seconds if I do (not Rprof'ed): > > fab <- data.frame(lc.ratio, Q, > fNupt, > rho.n, rho.s, > net.Nimm, > net.Nden, > CLminN, > CLmaxN, > CLmaxS) > > But this still seems quite a long time, so I'm thinking that there must > be a quicker of doing what I want (end up with a data.frame with the 10 > vectors in it). > > Can anyone enlighten me? > >> version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status Patched > major 2 > minor 4.0 > year 2006 > month 10 > day 03 > svn rev 39576 > language R > version.string R version 2.4.0 Patched (2006-10-03 r39576) > >> sessionInfo() > R version 2.4.0 Patched (2006-10-03 r39576) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] "methods" "stats" "graphics" "grDevices" "utils" > "datasets" > [7] "base" > > Thanks in advance, > > G >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sebastian Weber
2006-Nov-30 17:29 UTC
[R] Quicker way of combining vectors into a data.frame
Hi! I don't know for sure - and I have not tried it yet, but how about allocating a matrix which will hold all stuff, then put all vectors in it and at last assign some dimnames to it: data <- matrix(0, ncol=5, nrow=length(vec1)) data[1,] <- vec1 ... dimnames(data) <- list(c(1,2,3,4,5), ) as.data.frame(data) I forgot, I of course assume all of your vectors to be numeric ... Hope that helps! Greetings, Sebastian On Thu, 2006-11-30 at 17:00 +0000, Gavin Simpson wrote:> Hi, > > In a function, I compute 10 (un-named) vectors of reasonable length > (4471 in the particular example I have to hand) that I want to combine > into a data frame object, that the function will return. > > This is very slow, so *I'm* doing something wrong if I want it to be > quick and efficient, though I'm not sure what the best way to do this > would be. > > I know it is the combining into data frame bit that is slow, because > I've Rprof'ed it: > > $by.self > self.time self.pct total.time total.pct > "names<-.default" 16.58 52.8 16.58 52.8 > "unlist" 7.22 23.0 7.26 23.1 > "data.frame" 1.72 5.5 29.38 93.6 > "duplicated.default" 1.66 5.3 1.66 5.3 > "+" 1.20 3.8 1.20 3.8 > "list" 0.40 1.3 0.40 1.3 > "as.data.frame.numeric" 0.28 0.9 3.32 10.6 > "apply" 0.26 0.8 1.70 5.4 > "pmatch" 0.22 0.7 0.22 0.7 > "paste" 0.20 0.6 0.90 2.9 > "deparse" 0.14 0.4 0.70 2.2 > "eval" 0.12 0.4 31.28 99.7 > "names<-" 0.12 0.4 16.70 53.2 > "FUN" 0.12 0.4 1.32 4.2 > "names" 0.12 0.4 0.14 0.4 > "as.list.default" 0.12 0.4 0.12 0.4 > "duplicated" 0.10 0.3 1.76 5.6 > "gc" 0.10 0.3 0.10 0.3 > > And I stepped through it under debug() and all the calculations before > are quick, and then this bit takes a little over 20 seconds to complete > > fab <- data.frame(lc.ratio = lc.ratio, Q = Q, > fNupt = fNupt, > rho.n = rho.n, rho.s = rho.s, > net.Nimm = net.Nimm, > net.Nden = net.Nden, > CLminN = CLminN, > CLmaxN = CLmaxN, > CLmaxS = CLmaxS) > > I can get it down to c. 5 seconds if I do (not Rprof'ed): > > fab <- data.frame(lc.ratio, Q, > fNupt, > rho.n, rho.s, > net.Nimm, > net.Nden, > CLminN, > CLmaxN, > CLmaxS) > > But this still seems quite a long time, so I'm thinking that there must > be a quicker of doing what I want (end up with a data.frame with the 10 > vectors in it). > > Can anyone enlighten me? > > > version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status Patched > major 2 > minor 4.0 > year 2006 > month 10 > day 03 > svn rev 39576 > language R > version.string R version 2.4.0 Patched (2006-10-03 r39576) > > > sessionInfo() > R version 2.4.0 Patched (2006-10-03 r39576) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] "methods" "stats" "graphics" "grDevices" "utils" > "datasets" > [7] "base" > > Thanks in advance, > > G > -- > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > Gavin Simpson [t] +44 (0)20 7679 0522 > ECRC & ENSIS, UCL Geography, [f] +44 (0)20 7679 0565 > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Marc Schwartz
2006-Nov-30 19:41 UTC
[R] Quicker way of combining vectors into a data.frame
On Thu, 2006-11-30 at 19:26 +0000, Gavin Simpson wrote:> On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote: > > Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline) > for your comments and suggestions. > > I noticed that two of the vectors were named and so I removed the names > (names(vec) <- NULL) and that pushed the execution time for the function > from c. 40 seconds to c. 115 seconds and all the time was taken within > the data.frame(...) call. So having names *on* some of the vectors > seemed to help things along, which was the opposite of what i had > expected. > > If I use the cbind method of Marc, then the execution time for the > function drops to c. 1 second (most of which is in the calculation of > one of the vectors). So I guess I can work round this now. > > What I find interesting is that: > > test.dat <- rnorm(4471) > > system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 > test.dat, > + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, > + col8 = test.dat, col9 = test.dat, col10 = test.dat)) > [1] 0.008 0.000 0.007 0.000 0.000 > > Whereas doing exactly the same thing with different data in the function > gives the following timings: > > system.time(fab <- data.frame(lc.ratio, Q, > + fNupt, > + rho.n, rho.s, > + net.Nimm, > + net.Nden, > + CLminN, > + CLmaxN, > + CLmaxS)) > [1] 173.415 0.260 192.192 0.000 0.000 > > Most of that was without a change in memory, but towards the end for c. > 5 seconds memory use by R increased by 200-300 MB. > > and... > > > system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q, > + fNupt = fNupt, > + rho.n = rho.n, rho.s = rho.s, > + net.Nimm = net.Nimm, > + net.Nden = net.Nden, > + CLminN = CLminN, > + CLmaxN = CLmaxN, > + CLmaxS = CLmaxS)) > [1] 99.966 0.140 114.091 0.000 0.000 > > Again with a slight increase in memory usage in last 5 seconds. So now, > having stripped the names of two of the vectors (so now all are > un-named), the un-named version of the data.frame call is almost twice > as slow as the named data.frame call. > > If I leave the names on the two vectors that had them, I get the > following timings for those same calls > > > system.time(fab <- data.frame(lc.ratio, Q, > + fNupt, > + rho.n, rho.s, > + net.Nimm, > + net.Nden, > + CLminN, > + CLmaxN, > + CLmaxS)) > [1] 96.234 0.244 101.706 0.000 0.000 > > > system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q, > + fNupt = fNupt, > + rho.n = rho.n, rho.s = rho.s, > + net.Nimm = net.Nimm, > + net.Nden = net.Nden, > + CLminN = CLminN, > + CLmaxN = CLmaxN, > + CLmaxS = CLmaxS)) > [1] 13.597 0.088 15.868 0.000 0.000 > > So having the 2 named vectors and using the named version of the > data.frame call is the fastest combination. > > This is all done within the debugger at the time when I would be > generating fab, and if I do, > > system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 > test.dat, > + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, > + col8 = test.dat, col9 = test.dat, col10 = test.dat)) > [1] 0.008 0.000 0.007 0.000 0.000 > > (as above) at this point in the debugger it is exceedingly quick. > > I just don't understand what is going on with data.frame. > > I have yet to try Prof. Ripley's suggestion of being a bit naughty with > R - I'll see if that is any quicker. > > Once again, thanks to you all for your suggestions.Gavin, Can you post the results of: str(fab) and str(lc.ratio) str(Q) str(fNupt) str(rho.n) str(rho.s) str(net.Nimm) str(net.Nden) str(CLminN) str(CLmaxN) str(CLmaxS) This is taking way too long. There is either something about one or more of these objects that is more complex than just being simple vectors, or there is something corrupt in your R session/environment. You might want to try running a new and clean R session using: R --vanilla and then re-run your code to see if that changes anything. If so, it suggests that my latter idea may be in play. HTH, Marc