Dear R listers --
The program below does the following tasks:
1. It creates a file (wintemp4) that is a subset of alldata4 consisting of
"winner" records in 50 industry groups (about 5400 obs);
2. It defines a function (myppr1) that runs the ppr function in modreg
once to generate goodness of fit (sum of squared errors) measures by number
of terms included in model and then reruns ppr using the number of terms
with the lowest sum of squared errors.
3. It grinds through a loop, subsetting wintemp4 by group and running
myppr1 for each
group subset; and
4. It puts the ppr output into a separate vector element for each group
(in an attempt to avoid "growing" the vector).
I am using R version 1.2.2 in Emacs/ESS on Win98 with 256mb RAM.
I have two questions; I would be most grateful for any help the list can
provide:
A. This program *seems* to take a long time. I have been careful to free
as much memory as I can, and the gc()'s seem to help avoid using the
swapfile and to keep available system resources above 90%. Is there
anything else I can do to make the program more efficient?
B. I say "seems" because after running the program for an hour, I
type
ctl-G to quit. The *R* session seemed to be terminated, with about 40 or
so groups processed, so I opened up another R session to try to see what
had happened. After I quit the second session, suddenly the first session
seemed to come back to life and spit out the printed output for the rest of
the groups! So I wonder if there is something I need to add to my program
to "force" it to finish processing? (I apologize for the inarticulate
way
I am posing this question!)
Thanks in advance.
David N. Beede
Economist
Office of Policy Development
Economics and Statistics Administration
U.S. Department of Commerce
Room 4858 HCHB
14th Street and Pennsylvania Avenue, N.W.
Washington, DC 20230
Voice: 202.482.1226
Fax: 202.482.0325
e-mail: david.beede at mail.doc.gov
#Here is the program
for(i in 1:4) gc()
load("alldata4.Rdata")
assign("wintemp4", subset(alldata4, 1 <= group & group <= 50
& winner==1))
rm(alldata4)
for(i in 1:4) gc()
library(modreg)
attach(wintemp4)
myppr1 <- function(x)
{
#run pprfile once to get list of sum of squared errors corresponding to differen
numbers of terms
pprfile.ppr <- ppr(
award~
ilogemp+ilogage+sdb+allsmall+
size2+size3+size4+size5+size6+size7+size8+size9+size10+
X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
data=x, nterms=1, max.terms= min(nrow(x),40), optlevel=3
)
#pick number of terms giving best fit
numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn>0])
pprfile.ppr <- ppr(
award~
ilogemp+ilogage+sdb+allsmall+
size2+size3+size4+size5+size6+size7+size8+size9+size10+
X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
data=x, nterms=numterm, max.terms= min(nrow(x),40), optlevel=3
)
cat("group =", x$group[1],"\n")
cat("NAIC =", x$naic4[1],"\n")
cat("cendiv =", as.character(x$cendiv[1]),"\n")
cat("number of obs used =", nrow(x),"\n")
print(summary(pprfile.ppr))
}
grouparr <- levels(as.factor(wintemp4$group))
pprest <- vector(mode="list",length=length(grouparr))
for(i in seq(along=grouparr))
{
subi <- subset(wintemp4,wintemp4$group==grouparr[i])
if(nrow(subi) > 40) pprest[i][[1]] <- myppr1(subi)
rm(subi)
print(gc())
}
detach(wintemp4)
2. How can one prevent "for loop" output data frame growth?
On p. 178 of "S Programming" by VR, there is a suggestion that it is
more
efficient to create an object at least the size of the ultimate output
object, in order to avoid generating copies of the object at each iteration
of a for loop. This seems easy enough for a vector, as illustrated by VR.
However, it is not obvious to me how to do this for the data frame I wish
to
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> "db" == david beede <david.beede at mail.doc.gov> writes:db> B. I say "seems" because after running the program for an db> hour, I type ctl-G to quit. The *R* session seemed to be db> terminated, with about 40 or so groups processed, so I opened db> up another R session to try to see what had happened. After I db> quit the second session, suddenly the first session seemed to db> come back to life and spit out the printed output for the rest db> of the groups! So I wonder if there is something I need to db> add to my program to "force" it to finish processing? (I db> apologize for the inarticulate way I am posing this question!) This delay "might" be due to problems with Emacs. (reason for cc'ing r-help) Is there anything comparable to "top" on unix, for windows, so that you can track process status? best, -tony -- A.J. Rossini Rsrch. Asst. Prof. of Biostatistics UW Biostat/Center for AIDS Research rossini at u.washington.edu FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org -------- (friday is unknown) -------- FHCRC: M--W : 206-667-7025 (fax=4812)|Voicemail is pretty sketchy CFAR: ?? : 206-731-3647 (fax=3694)|Email is far better than phone UW: Th : 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear R listers -- Thank you for the suggestions from Tony and Prof. Ripley about wintop. My understanding is that wintop will monitor CPU and memory usage by process, so one can tell quickly if an R program is still running or not. This is very useful! You should note however that the MS web page claims that wintop and the other PowerTools (or PowerKernel) applications should not be installed on Win 98 machines. However, a search through Yahoo found users advising people to disregard the disclaimer on the MS website, although some of the other PowerTools did cause problems in Win98. Also, I looked through www.zdnet.com and found a highly-rated product called TaskInfo v.2.21 that is free for a thirty day evaluation and only costs $12. It seems to work well. (Note: this is neither a plug nor an endorsement of the product.) Prof Brian D Ripley <ripley at stats.ox.ac.uk>@stats.ox.ac.uk on 03/28/2001 01:17:36 PM Sent by: ripley at stats.ox.ac.uk To: <rossini at u.washington.edu> cc: <david.beede at mail.doc.gov>, <r-help at stat.math.ethz.ch> Subject: Re: [R] efficiency and "forcing" questions On 28 Mar 2001, A.J. Rossini wrote:> >>>>> "db" == david beede <david.beede at mail.doc.gov> writes: > > db> B. I say "seems" because after running the program for an > db> hour, I type ctl-G to quit. The *R* session seemed to be > db> terminated, with about 40 or so groups processed, so I opened > db> up another R session to try to see what had happened. After I > db> quit the second session, suddenly the first session seemed to > db> come back to life and spit out the printed output for the rest > db> of the groups! So I wonder if there is something I need to > db> add to my program to "force" it to finish processing? (I > db> apologize for the inarticulate way I am posing this question!) > > This delay "might" be due to problems with Emacs. > > (reason for cc'ing r-help) Is there anything comparable to "top" on > unix, for windows, so that you can track process status?Task Manager on NT/2000/XP; right-click taskbar. wintop (a `powertoy') on 95/98/Me: download from Microsoft> > best, > -tony > > -- > A.J. Rossini Rsrch. Asst. Prof. of Biostatistics > UW Biostat/Center for AIDS Research rossini at u.washington.edu > FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org > -------- (friday is unknown) -------- > FHCRC: M--W : 206-667-7025 (fax=4812)|Voicemail is pretty sketchy > CFAR: ?? : 206-731-3647 (fax=3694)|Email is far better than phone > UW: Th : 206-543-1044 (fax=3286)|Change last 4 digits of phone toFAX> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-> r-help mailing list -- Readhttp://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._>-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 28 Mar 2001 david.beede at mail.doc.gov wrote:> > Dear R listers -- > Thank you for the suggestions from Tony and Prof. Ripley about wintop. My > understanding is that wintop will monitor CPU and memory usage by process, > so one can tell quickly if an R program is still running or not. This is > very useful! > > You should note however that the MS web page claims that wintop and the > other PowerTools (or PowerKernel) applications should not be installed on > Win 98 machines. However, a search through Yahoo found users advising > people to disregard the disclaimer on the MS website, although some of the > other PowerTools did cause problems in Win98.Well, I have used it many times on Win98 machines, both 98 (4.10) and 98SE (4.10a) so I would just ignore the warning. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thank you for the clarification about wintop, Prof. Ripley.
Now that I have run my program on the full data set with 67,000
observations and 382 groups (of which 229 were skipped because they had 40
or fewer obs), I wanted to pose again my questions about the efficiency of
my program. According to my task monitor software, it took 10 hours of CPU
time to run on my Win 98 machine with 256 mb of RAM (R v. 1.2.2; EMACS
v20.7, ESS v. 5.1.18). At least for the first two hours of operation it
ran without using the swapfile, although at some point afterwards it did
start using it, according to the task monitor.
Interestingly, the 10 hours of CPU time was split as follows: 6 hours for
EMACS.EXE and 4 hours for RTERM.EXE. Does this necessarily mean that if I
source()'d my program directly into RTERM that I could save a lot of time?
(I just want to note here that I have found the EMACS/ESS combinations
*extremely* helpful for developing my code; it would be nice if it were
indeed the case that after development one could switch over to solo R to
do the big jobs.)
Also -- my theory that applying gc() multiple times would free up memory
did not seem to pan out. I apologize about my rash speculation.
Thanks in advance.
David N. Beede
Economist
Office of Policy Development
Economics and Statistics Administration
U.S. Department of Commerce
Room 4858 HCHB
14th Street and Pennsylvania Avenue, N.W.
Washington, DC 20230
Voice: 202.482.1226
Fax: 202.482.0325
e-mail: david.beede at mail.doc.gov
The program below does the following tasks:
1. It creates a file (wintemp4) that is a subset of alldata4 consisting of
"winner" records;
2. It defines a function (myppr1) that runs the ppr function in modreg
once to generate goodness of fit (sum of squared errors) measures by number
of terms included in model and then reruns ppr using the number of terms
with the lowest sum of squared errors.
3. It grinds through a loop, subsetting wintemp4 by group and running
myppr1 for each
group subset; and
4. It puts the ppr output into a separate vector element for each group
(in an attempt to avoid "growing" the vector).
#Here is the program
for(i in 1:4) gc()
load("alldata4.Rdata")
assign("wintemp4", subset(alldata4, winner==1))
rm(alldata4)
for(i in 1:4) gc()
library(modreg)
attach(wintemp4)
myppr1 <- function(x)
{
#run pprfile once to get list of sum of squared errors corresponding to differen
numbers of terms
pprfile.ppr <- ppr(
award~
ilogemp+ilogage+sdb+allsmall+
size2+size3+size4+size5+size6+size7+size8+size9+size10+
X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
data=x, nterms=1, max.terms= min(nrow(x),40), optlevel=3
)
#pick number of terms giving best fit
numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn>0])
pprfile.ppr <- ppr(
award~
ilogemp+ilogage+sdb+allsmall+
size2+size3+size4+size5+size6+size7+size8+size9+size10+
X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
data=x, nterms=numterm, max.terms= min(nrow(x),40), optlevel=3
)
cat("group =", x$group[1],"\n")
cat("NAIC =", x$naic4[1],"\n")
cat("cendiv =", as.character(x$cendiv[1]),"\n")
cat("number of obs used =", nrow(x),"\n")
print(summary(pprfile.ppr))
}
grouparr <- levels(as.factor(wintemp4$group))
pprest <- vector(mode="list",length=length(grouparr))
for(i in seq(along=grouparr))
{
subi <- subset(wintemp4,wintemp4$group==grouparr[i])
if(nrow(subi) > 40) pprest[i][[1]] <- myppr1(subi)
rm(subi)
print(gc())
}
detach(wintemp4)
Prof Brian D Ripley <ripley at stats.ox.ac.uk>@auk.stats> on 03/29/2001
12:04:13 AM
Sent by: <ripley at auk.stats>
To: <david.beede at mail.doc.gov>
cc: <r-help at stat.math.ethz.ch>
Subject: Re: [R] efficiency and "forcing" questions
On Wed, 28 Mar 2001 david.beede at mail.doc.gov wrote:
>
> Dear R listers --
> Thank you for the suggestions from Tony and Prof. Ripley about wintop.
My> understanding is that wintop will monitor CPU and memory usage by
process,> so one can tell quickly if an R program is still running or not. This is
> very useful!
>
> You should note however that the MS web page claims that wintop and the
> other PowerTools (or PowerKernel) applications should not be installed on
> Win 98 machines. However, a search through Yahoo found users advising
> people to disregard the disclaimer on the MS website, although some of
the> other PowerTools did cause problems in Win98.
Well, I have used it many times on Win98 machines, both 98 (4.10) and 98SE
(4.10a) so I would just ignore the warning.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._