Dear Terry,
Thanks for pointing me to your vignette. I can't say that I fully
understand it however. I'd be grateful for your help with a couple of
specific questions:
* In my application, each potential split will be a function of the
outcome (y), as well as two other variables (p1 and p2). These will
need to be passed down the tree, as my data is the set of {y, p1,
p2, X}. How would p1 and p2 be passed to the three functions that
you describe (init, eval, and split)?
* What does the initialize function do exactly? I see that it takes
as arguments (y), along with offsets, weights, and "parms", and
outputs y again, along with numresp, numy, and a string. I guess
I'd understand what this function was doing if I understood what
contexts it was called in. Why is it needed? What process needs
"numresp" and "numy", and why does it need them? And
where is
"sfun" getting its information from? And what gets done with the
string that gets outputted from it?
* The evaluation function makes more sense to me at present. It takes
y, along with weights and parameters, and gives back the deviance,
which is how different potential splits are evaluated. The label is
passed along to something (what is it passed to?) so that the mean
of y in that partition can be known.
o If it wanted to define a different evaluation function rss f(y,
p1,p2), using only data from within a particular node,
would p1 and p2 be components of the "parms" object? Is
"parms"
a list?
* The splitting function makes the most sense of all to me -- the
arguments are data (presumably only within-node data?), "parms"
contains my p1 and p2 parameters (I still want to know if "parms"
is
a list or what), and outputs a goodness of fit measure, and a direction.
o What is the point of the direction? Is not a split <x the same
as a split >x?
o If each splitting function evaluates goodness, what is the
deviance in the eval function doing?
I appreciate your patience with my questions. I'm sure that some of
them will become painfully obvious once I'm able to share your perspective.
Best,
Andrew
On 11/12/2015 07:44 AM, Therneau, Terry M., Ph.D. wrote:> Look at the rpart vignette "User written split functions". The
code
> allows you to add your own splitting method to the code (in R, no C
> required). This has proven to be very useful for trying out new ideas.
>
> The second piece would be to do your own cross-validation. That is,
> turn off the built in cross-validation using the xval=0 option, then
> explicitly do the cross-validation yourself. Fit a new tree to some
> chosen subset of data, using your split rule of course, and then use
> predict() to get predicted values for the remaining observations.
> Again, this is all in R, and you can explicitly control your in or out
> of bag subsets.
> The xpred.rpart function may be useful to automate some of the steps.
>
> If you look up rpart on CRAN, you will see a link to the package
> source. If you were to read the C source code you will discover that
> 95% is boring bookkeeping of what observations are in what part(s) of
> the tree, sorting the data, tracking missing values, etc. If you ever
> do want to write your own code you are more than welcome to build off
> this --- I wouldn't want to write that part again.
>
> Terry Therneau
>
> On 11/12/2015 05:00 AM, r-help-request at r-project.org wrote:
>> Dear List,
>>
>> I'd like to make a few modifications to the typical CART algorithm,
and
>> I'd rather not code the whole thing from scratch. Specifically I
want
>> to use different in-sample and out-of-sample fit criteria in the split
>> choosing and cross-validation stages.
>>
>> I see however that the code for CART in both the rpart and the tree
>> packages is written in C.
>>
>> Two questions:
>>
>> * Where is the C code? It might be possible to get a C-fluent
>> programmer to help me with this.
>> * Is there any code for CART that is written entirely in R?
>>
>> Thanks,
>> Andrew
[[alternative HTML version deleted]]