Dear R listers -- The program below does the following tasks: 1. It creates a file (wintemp4) that is a subset of alldata4 consisting of "winner" records in 50 industry groups (about 5400 obs); 2. It defines a function (myppr1) that runs the ppr function in modreg once to generate goodness of fit (sum of squared errors) measures by number of terms included in model and then reruns ppr using the number of terms with the lowest sum of squared errors. 3. It grinds through a loop, subsetting wintemp4 by group and running myppr1 for each group subset; and 4. It puts the ppr output into a separate vector element for each group (in an attempt to avoid "growing" the vector). I am using R version 1.2.2 in Emacs/ESS on Win98 with 256mb RAM. I have two questions; I would be most grateful for any help the list can provide: A. This program *seems* to take a long time. I have been careful to free as much memory as I can, and the gc()'s seem to help avoid using the swapfile and to keep available system resources above 90%. Is there anything else I can do to make the program more efficient? B. I say "seems" because after running the program for an hour, I type ctl-G to quit. The *R* session seemed to be terminated, with about 40 or so groups processed, so I opened up another R session to try to see what had happened. After I quit the second session, suddenly the first session seemed to come back to life and spit out the printed output for the rest of the groups! So I wonder if there is something I need to add to my program to "force" it to finish processing? (I apologize for the inarticulate way I am posing this question!) Thanks in advance. David N. Beede Economist Office of Policy Development Economics and Statistics Administration U.S. Department of Commerce Room 4858 HCHB 14th Street and Pennsylvania Avenue, N.W. Washington, DC 20230 Voice: 202.482.1226 Fax: 202.482.0325 e-mail: david.beede at mail.doc.gov #Here is the program for(i in 1:4) gc() load("alldata4.Rdata") assign("wintemp4", subset(alldata4, 1 <= group & group <= 50 & winner==1)) rm(alldata4) for(i in 1:4) gc() library(modreg) attach(wintemp4) myppr1 <- function(x) { #run pprfile once to get list of sum of squared errors corresponding to differen numbers of terms pprfile.ppr <- ppr( award~ ilogemp+ilogage+sdb+allsmall+ size2+size3+size4+size5+size6+size7+size8+size9+size10+ X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+ X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+ X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+ X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+ X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26, data=x, nterms=1, max.terms= min(nrow(x),40), optlevel=3 ) #pick number of terms giving best fit numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn>0]) pprfile.ppr <- ppr( award~ ilogemp+ilogage+sdb+allsmall+ size2+size3+size4+size5+size6+size7+size8+size9+size10+ X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+ X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+ X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+ X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+ X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26, data=x, nterms=numterm, max.terms= min(nrow(x),40), optlevel=3 ) cat("group =", x$group[1],"\n") cat("NAIC =", x$naic4[1],"\n") cat("cendiv =", as.character(x$cendiv[1]),"\n") cat("number of obs used =", nrow(x),"\n") print(summary(pprfile.ppr)) } grouparr <- levels(as.factor(wintemp4$group)) pprest <- vector(mode="list",length=length(grouparr)) for(i in seq(along=grouparr)) { subi <- subset(wintemp4,wintemp4$group==grouparr[i]) if(nrow(subi) > 40) pprest[i][[1]] <- myppr1(subi) rm(subi) print(gc()) } detach(wintemp4) 2. How can one prevent "for loop" output data frame growth? On p. 178 of "S Programming" by VR, there is a suggestion that it is more efficient to create an object at least the size of the ultimate output object, in order to avoid generating copies of the object at each iteration of a for loop. This seems easy enough for a vector, as illustrated by VR. However, it is not obvious to me how to do this for the data frame I wish to -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> "db" == david beede <david.beede at mail.doc.gov> writes:db> B. I say "seems" because after running the program for an db> hour, I type ctl-G to quit. The *R* session seemed to be db> terminated, with about 40 or so groups processed, so I opened db> up another R session to try to see what had happened. After I db> quit the second session, suddenly the first session seemed to db> come back to life and spit out the printed output for the rest db> of the groups! So I wonder if there is something I need to db> add to my program to "force" it to finish processing? (I db> apologize for the inarticulate way I am posing this question!) This delay "might" be due to problems with Emacs. (reason for cc'ing r-help) Is there anything comparable to "top" on unix, for windows, so that you can track process status? best, -tony -- A.J. Rossini Rsrch. Asst. Prof. of Biostatistics UW Biostat/Center for AIDS Research rossini at u.washington.edu FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org -------- (friday is unknown) -------- FHCRC: M--W : 206-667-7025 (fax=4812)|Voicemail is pretty sketchy CFAR: ?? : 206-731-3647 (fax=3694)|Email is far better than phone UW: Th : 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear R listers -- Thank you for the suggestions from Tony and Prof. Ripley about wintop. My understanding is that wintop will monitor CPU and memory usage by process, so one can tell quickly if an R program is still running or not. This is very useful! You should note however that the MS web page claims that wintop and the other PowerTools (or PowerKernel) applications should not be installed on Win 98 machines. However, a search through Yahoo found users advising people to disregard the disclaimer on the MS website, although some of the other PowerTools did cause problems in Win98. Also, I looked through www.zdnet.com and found a highly-rated product called TaskInfo v.2.21 that is free for a thirty day evaluation and only costs $12. It seems to work well. (Note: this is neither a plug nor an endorsement of the product.) Prof Brian D Ripley <ripley at stats.ox.ac.uk>@stats.ox.ac.uk on 03/28/2001 01:17:36 PM Sent by: ripley at stats.ox.ac.uk To: <rossini at u.washington.edu> cc: <david.beede at mail.doc.gov>, <r-help at stat.math.ethz.ch> Subject: Re: [R] efficiency and "forcing" questions On 28 Mar 2001, A.J. Rossini wrote:> >>>>> "db" == david beede <david.beede at mail.doc.gov> writes: > > db> B. I say "seems" because after running the program for an > db> hour, I type ctl-G to quit. The *R* session seemed to be > db> terminated, with about 40 or so groups processed, so I opened > db> up another R session to try to see what had happened. After I > db> quit the second session, suddenly the first session seemed to > db> come back to life and spit out the printed output for the rest > db> of the groups! So I wonder if there is something I need to > db> add to my program to "force" it to finish processing? (I > db> apologize for the inarticulate way I am posing this question!) > > This delay "might" be due to problems with Emacs. > > (reason for cc'ing r-help) Is there anything comparable to "top" on > unix, for windows, so that you can track process status?Task Manager on NT/2000/XP; right-click taskbar. wintop (a `powertoy') on 95/98/Me: download from Microsoft> > best, > -tony > > -- > A.J. Rossini Rsrch. Asst. Prof. of Biostatistics > UW Biostat/Center for AIDS Research rossini at u.washington.edu > FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org > -------- (friday is unknown) -------- > FHCRC: M--W : 206-667-7025 (fax=4812)|Voicemail is pretty sketchy > CFAR: ?? : 206-731-3647 (fax=3694)|Email is far better than phone > UW: Th : 206-543-1044 (fax=3286)|Change last 4 digits of phone toFAX> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-> r-help mailing list -- Readhttp://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._>-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 28 Mar 2001 david.beede at mail.doc.gov wrote:> > Dear R listers -- > Thank you for the suggestions from Tony and Prof. Ripley about wintop. My > understanding is that wintop will monitor CPU and memory usage by process, > so one can tell quickly if an R program is still running or not. This is > very useful! > > You should note however that the MS web page claims that wintop and the > other PowerTools (or PowerKernel) applications should not be installed on > Win 98 machines. However, a search through Yahoo found users advising > people to disregard the disclaimer on the MS website, although some of the > other PowerTools did cause problems in Win98.Well, I have used it many times on Win98 machines, both 98 (4.10) and 98SE (4.10a) so I would just ignore the warning. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thank you for the clarification about wintop, Prof. Ripley. Now that I have run my program on the full data set with 67,000 observations and 382 groups (of which 229 were skipped because they had 40 or fewer obs), I wanted to pose again my questions about the efficiency of my program. According to my task monitor software, it took 10 hours of CPU time to run on my Win 98 machine with 256 mb of RAM (R v. 1.2.2; EMACS v20.7, ESS v. 5.1.18). At least for the first two hours of operation it ran without using the swapfile, although at some point afterwards it did start using it, according to the task monitor. Interestingly, the 10 hours of CPU time was split as follows: 6 hours for EMACS.EXE and 4 hours for RTERM.EXE. Does this necessarily mean that if I source()'d my program directly into RTERM that I could save a lot of time? (I just want to note here that I have found the EMACS/ESS combinations *extremely* helpful for developing my code; it would be nice if it were indeed the case that after development one could switch over to solo R to do the big jobs.) Also -- my theory that applying gc() multiple times would free up memory did not seem to pan out. I apologize about my rash speculation. Thanks in advance. David N. Beede Economist Office of Policy Development Economics and Statistics Administration U.S. Department of Commerce Room 4858 HCHB 14th Street and Pennsylvania Avenue, N.W. Washington, DC 20230 Voice: 202.482.1226 Fax: 202.482.0325 e-mail: david.beede at mail.doc.gov The program below does the following tasks: 1. It creates a file (wintemp4) that is a subset of alldata4 consisting of "winner" records; 2. It defines a function (myppr1) that runs the ppr function in modreg once to generate goodness of fit (sum of squared errors) measures by number of terms included in model and then reruns ppr using the number of terms with the lowest sum of squared errors. 3. It grinds through a loop, subsetting wintemp4 by group and running myppr1 for each group subset; and 4. It puts the ppr output into a separate vector element for each group (in an attempt to avoid "growing" the vector). #Here is the program for(i in 1:4) gc() load("alldata4.Rdata") assign("wintemp4", subset(alldata4, winner==1)) rm(alldata4) for(i in 1:4) gc() library(modreg) attach(wintemp4) myppr1 <- function(x) { #run pprfile once to get list of sum of squared errors corresponding to differen numbers of terms pprfile.ppr <- ppr( award~ ilogemp+ilogage+sdb+allsmall+ size2+size3+size4+size5+size6+size7+size8+size9+size10+ X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+ X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+ X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+ X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+ X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26, data=x, nterms=1, max.terms= min(nrow(x),40), optlevel=3 ) #pick number of terms giving best fit numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn>0]) pprfile.ppr <- ppr( award~ ilogemp+ilogage+sdb+allsmall+ size2+size3+size4+size5+size6+size7+size8+size9+size10+ X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+ X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+ X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+ X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+ X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26, data=x, nterms=numterm, max.terms= min(nrow(x),40), optlevel=3 ) cat("group =", x$group[1],"\n") cat("NAIC =", x$naic4[1],"\n") cat("cendiv =", as.character(x$cendiv[1]),"\n") cat("number of obs used =", nrow(x),"\n") print(summary(pprfile.ppr)) } grouparr <- levels(as.factor(wintemp4$group)) pprest <- vector(mode="list",length=length(grouparr)) for(i in seq(along=grouparr)) { subi <- subset(wintemp4,wintemp4$group==grouparr[i]) if(nrow(subi) > 40) pprest[i][[1]] <- myppr1(subi) rm(subi) print(gc()) } detach(wintemp4) Prof Brian D Ripley <ripley at stats.ox.ac.uk>@auk.stats> on 03/29/2001 12:04:13 AM Sent by: <ripley at auk.stats> To: <david.beede at mail.doc.gov> cc: <r-help at stat.math.ethz.ch> Subject: Re: [R] efficiency and "forcing" questions On Wed, 28 Mar 2001 david.beede at mail.doc.gov wrote:> > Dear R listers -- > Thank you for the suggestions from Tony and Prof. Ripley about wintop.My> understanding is that wintop will monitor CPU and memory usage byprocess,> so one can tell quickly if an R program is still running or not. This is > very useful! > > You should note however that the MS web page claims that wintop and the > other PowerTools (or PowerKernel) applications should not be installed on > Win 98 machines. However, a search through Yahoo found users advising > people to disregard the disclaimer on the MS website, although some ofthe> other PowerTools did cause problems in Win98.Well, I have used it many times on Win98 machines, both 98 (4.10) and 98SE (4.10a) so I would just ignore the warning. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._