thr3ads.net - R help - [R] repeating an analysis [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Andrew Halford

2010-Oct-12 23:50 UTC

[R] repeating an analysis

Hi All,

I have to say upfront that I am a complete neophyte when it comes to
programming. Nevertheless I enjoy the challenge of using R because of its
incredible statistical resources.

My problem is this .........I am running a regression tree analysis using
"rpart" and I need to run the calculation repeatedly (say n=50 times)
to
obtain a distribution of results from which I will pick the median one to
represent the most parsimonious tree size. Unfortunately rpart does not
contain this ability so it will have to be coded for.

Could anyone help me with this? I have provided the code (and relevant
output) for the analysis I am running. I need to run it n=50 times and from
each output pick the appropriate tree size and post it to a datafile where I
can then look at the frequency distribution of tree sizes.

Here is the code and output from a single run
> fit1 <- rpart(CHAB~.,data=chabun, method="anova",
control=rpart.control(minsplit=10, cp=0.01, xval=10))> printcp(fit1)
Regression tree:
rpart(formula = CHAB ~ ., data = chabun, method = "anova", control
rpart.control(minsplit = 10,
    cp = 0.01, xval = 10))
Variables actually used in tree construction:
[1] EXP LAT POC RUG
Root node error: 35904/33 = 1088
n= 33
        CP nsplit rel error xerror    xstd
1 0.539806      0   1.00000 1.0337 0.41238
2 0.050516      1   0.46019 1.2149 0.38787
3 0.016788      2   0.40968 1.2719 0.41280
4 0.010221      3   0.39289 1.1852 0.38300
5 0.010000      4   0.38267 1.1740 0.38333

Each time I re-run the model I will get a slightly different output. I want
to extract the nsplit number corresponding to the lowest xerror for each run
of the model (in this case it is for nsplit = 0) over 50 runs and then look
at the distribution of nsplits after 50 runs.

Any help appreciated.


Andy


-- 
Andrew Halford
Associate Researcher
Marine Laboratory
University of Guam
Ph: +1 671 734 2948

	[[alternative HTML version deleted]]

Phil Spector

2010-Oct-13 00:30 UTC

head link

[R] repeating an analysis

Andrew -
    I think

answer = replicate(50,{fit1 <- rpart(CHAB~.,data=chabun,
method="anova",
                                      control=rpart.control(minsplit=10,
                                              cp=0.01, xval=10));
                                      x = printcp(fit1);
                                     
x[which.min(x[,'xerror']),'nsplit']})

will put the numbers you want into answer, but there was no reproducible
example to test it on.  Unfortunately, I don't know of any way to 
surpress the printing from printcp().

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu




On Wed, 13 Oct 2010, Andrew Halford wrote:
> Hi All,
>
> I have to say upfront that I am a complete neophyte when it comes to
> programming. Nevertheless I enjoy the challenge of using R because of its
> incredible statistical resources.
>
> My problem is this .........I am running a regression tree analysis using
> "rpart" and I need to run the calculation repeatedly (say n=50
times) to
> obtain a distribution of results from which I will pick the median one to
> represent the most parsimonious tree size. Unfortunately rpart does not
> contain this ability so it will have to be coded for.
>
> Could anyone help me with this? I have provided the code (and relevant
> output) for the analysis I am running. I need to run it n=50 times and from
> each output pick the appropriate tree size and post it to a datafile where
I
> can then look at the frequency distribution of tree sizes.
>
> Here is the code and output from a single run
>
>> fit1 <- rpart(CHAB~.,data=chabun, method="anova",
> control=rpart.control(minsplit=10, cp=0.01, xval=10))
>> printcp(fit1)
>
> Regression tree:
> rpart(formula = CHAB ~ ., data = chabun, method = "anova",
control > rpart.control(minsplit = 10,
>    cp = 0.01, xval = 10))
> Variables actually used in tree construction:
> [1] EXP LAT POC RUG
> Root node error: 35904/33 = 1088
> n= 33
>        CP nsplit rel error xerror    xstd
> 1 0.539806      0   1.00000 1.0337 0.41238
> 2 0.050516      1   0.46019 1.2149 0.38787
> 3 0.016788      2   0.40968 1.2719 0.41280
> 4 0.010221      3   0.39289 1.1852 0.38300
> 5 0.010000      4   0.38267 1.1740 0.38333
>
> Each time I re-run the model I will get a slightly different output. I want
> to extract the nsplit number corresponding to the lowest xerror for each
run
> of the model (in this case it is for nsplit = 0) over 50 runs and then look
> at the distribution of nsplits after 50 runs.
>
> Any help appreciated.
>
>
> Andy
>
>
> -- 
> Andrew Halford
> Associate Researcher
> Marine Laboratory
> University of Guam
> Ph: +1 671 734 2948
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Peter Langfelder

2010-Oct-13 00:32 UTC

head link

[R] repeating an analysis

I think you want something like this:

optimal.nSplit = rep(NA, 50) # This will hold the result
for (run in 1:50)
{
  fit1 = rpart(...)
  cpTable = fit1$cptable
  bestRow = which.min(cpTable[, "xerror"]);
  optimal.nSplit[run] = cpTable[bestRow, "nsplit"]
}


In any case, look at
?rpart
?printcp
?rpart.object

Peter


On Tue, Oct 12, 2010 at 4:50 PM, Andrew Halford
<andrew.halford at gmail.com> wrote:> Hi All,
>
> I have to say upfront that I am a complete neophyte when it comes to
> programming. Nevertheless I enjoy the challenge of using R because of its
> incredible statistical resources.
>
> My problem is this .........I am running a regression tree analysis using
> "rpart" and I need to run the calculation repeatedly (say n=50
times) to
> obtain a distribution of results from which I will pick the median one to
> represent the most parsimonious tree size. Unfortunately rpart does not
> contain this ability so it will have to be coded for.
>
> Could anyone help me with this? I have provided the code (and relevant
> output) for the analysis I am running. I need to run it n=50 times and from
> each output pick the appropriate tree size and post it to a datafile where
I
> can then look at the frequency distribution of tree sizes.
>
> Here is the code and output from a single run
>
>> fit1 <- rpart(CHAB~.,data=chabun, method="anova",
> control=rpart.control(minsplit=10, cp=0.01, xval=10))
>> printcp(fit1)
>
> Regression tree:
> rpart(formula = CHAB ~ ., data = chabun, method = "anova",
control > rpart.control(minsplit = 10,
> ? ?cp = 0.01, xval = 10))
> Variables actually used in tree construction:
> [1] EXP LAT POC RUG
> Root node error: 35904/33 = 1088
> n= 33
> ? ? ? ?CP nsplit rel error xerror ? ?xstd
> 1 0.539806 ? ? ?0 ? 1.00000 1.0337 0.41238
> 2 0.050516 ? ? ?1 ? 0.46019 1.2149 0.38787
> 3 0.016788 ? ? ?2 ? 0.40968 1.2719 0.41280
> 4 0.010221 ? ? ?3 ? 0.39289 1.1852 0.38300
> 5 0.010000 ? ? ?4 ? 0.38267 1.1740 0.38333
>
> Each time I re-run the model I will get a slightly different output. I want
> to extract the nsplit number corresponding to the lowest xerror for each
run
> of the model (in this case it is for nsplit = 0) over 50 runs and then look
> at the distribution of nsplits after 50 runs.
>
> Any help appreciated.
>
>
> Andy
>
>
> --
> Andrew Halford
> Associate Researcher
> Marine Laboratory
> University of Guam
> Ph: +1 671 734 2948
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Oct 2010 - repeating an analysis

[R] repeating an analysis

[R] repeating an analysis

[R] repeating an analysis

Seemingly Similar Threads