Apologies for re-posting, my original message seems to have been overlooked by the moderators. ---------- Forwarded message ---------- From: Ed <icelus2k5 at gmail.com> Date: 11 October 2012 19:03 Subject: party for prediction To: R-help at r-project.org Hi there I'm experiencing some problems using the party package (specifically mob) for prediction. I have a real scalar y I want to predict from a real valued vector x and an integral vector z. mob seemed the ideal choice from the documentation. The first problem I had was at some nodes in a partitioning tree, the components of x may be extremely highly correlated or effectively constant (that is x are not independent for all choices of components of z). When the resulting fit is fed into predict() the result is NA - this is not the same behaviour as models returned by say lm which ignore missing coefficients. I have fixed this by defining my own statsModel (myLinearModel - imaginative) which also ignores such coefficients when predicting. The second problem I have is that I get "Cholesky not positive definite" errors at some nodes. I guess this is because of numerical error and degeneracy in the covariance matrix? Any thoughts on how to avoid having this happen would be welcome; it is ignorable though for now. The third and really big problem I have is that when I apply mob to large datasets (say hundreds of thousands of elements) I get a "logical subscript too long" error inside mob_fit_fluctests. It's caught in a try(), and mob just gives up and treats the node as terminal. This is really hurting me though; with 1% of my data I can get a good fit and a worthwhile tree, but with the whole dataset I get a very stunted tree with a pretty useless prediction ability. I guess what I really want to know is: (a) has anyone else had this problem, and if so how did they overcome it? (b) is there any way to get a line or stack trace out of a try() without source modification? (c) failing all of that, does anyone know of an alternative to mob that does the same thing; for better or worse I'm now committed to recursive partitioning over linear models, as per mob? (d) failing all of this, does anyone have a link to a way to rebuild, or locally modify, an R package (preferably windows, but anything would do)? Sorry for the length of this post. If I should RTFM, please point me at any relevant manual by all means. I've spent a few days on this as you can maybe tell, but I'm far from being an R expert. Thanks for any help you can give. Best wishes, Ed
On Oct 12, 2012, at 1:37 AM, Ed wrote:> Apologies for re-posting, my original message seems to have been > overlooked by the moderators. >No. Your original post _was_ forwarded to the list. On my machine it appeared at October 11, 2012 11:03:08 AM PDT. No one responded. It seems possible that its lack of data or code is the reason for that state of affairs. -- David.> ---------- Forwarded message ---------- > From: Ed <icelus2k5 at gmail.com> > Date: 11 October 2012 19:03 > Subject: party for prediction > To: R-help at r-project.org > > > Hi there > > I'm experiencing some problems using the party package (specifically > mob) for prediction. I have a real scalar y I want to predict from a > real valued vector x and an integral vector z. mob seemed the ideal > choice from the documentation. > > The first problem I had was at some nodes in a partitioning tree, the > components of x may be extremely highly correlated or effectively > constant (that is x are not independent for all choices of components > of z). When the resulting fit is fed into predict() the result is NA - > this is not the same behaviour as models returned by say lm which > ignore missing coefficients. I have fixed this by defining my own > statsModel (myLinearModel - imaginative) which also ignores such > coefficients when predicting. > > The second problem I have is that I get "Cholesky not positive > definite" errors at some nodes. I guess this is because of numerical > error and degeneracy in the covariance matrix? Any thoughts on how to > avoid having this happen would be welcome; it is ignorable though for > now. > > The third and really big problem I have is that when I apply mob to > large datasets (say hundreds of thousands of elements) I get a > "logical subscript too long" error inside mob_fit_fluctests. It's > caught in a try(), and mob just gives up and treats the node as > terminal. This is really hurting me though; with 1% of my data I can > get a good fit and a worthwhile tree, but with the whole dataset I get > a very stunted tree with a pretty useless prediction ability. > > I guess what I really want to know is: > (a) has anyone else had this problem, and if so how did they overcome it? > (b) is there any way to get a line or stack trace out of a try() > without source modification? > (c) failing all of that, does anyone know of an alternative to mob > that does the same thing; for better or worse I'm now committed to > recursive partitioning over linear models, as per mob? > (d) failing all of this, does anyone have a link to a way to rebuild, > or locally modify, an R package (preferably windows, but anything > would do)? > > Sorry for the length of this post. If I should RTFM, please point me > at any relevant manual by all means. I've spent a few days on this as > you can maybe tell, but I'm far from being an R expert. > > Thanks for any help you can give. > > Best wishes, > > EdDavid Winsemius, MD Alameda, CA, USA
Ed:> I'm experiencing some problems using the party package (specifically > mob) for prediction. I have a real scalar y I want to predict from a > real valued vector x and an integral vector z. mob seemed the ideal > choice from the documentation.I'm not sure what you mean by "integral vector". If you want to apply the approach to hundreds of thousands of observations, I gues that these are categorical (maybe even binary?) but maybe not...> The first problem I had was at some nodes in a partitioning tree, the > components of x may be extremely highly correlated or effectively > constant (that is x are not independent for all choices of components of > z). When the resulting fit is fed into predict() the result is NA - this > is not the same behaviour as models returned by say lm which ignore > missing coefficients. I have fixed this by defining my own statsModel > (myLinearModel - imaginative) which also ignores such coefficients when > predicting.If I recall correctly, we kept linearModel as simple as we did to save as much time as possible. This can be particularly important when one of the partitioning variables has many possible splits and the linearModel has to be fitted thousands of times. Also, mob() assesses the stability of all coefficients of the model in all nodes during partitioning. If any of the coefficients is not identified, this would have to be excluded from all subsequent parameter stability tests in that node (and its child nodes). This is currently not provided for in mob().> The second problem I have is that I get "Cholesky not positive definite" > errors at some nodes. I guess this is because of numerical error and > degeneracy in the covariance matrix? Any thoughts on how to avoid having > this happen would be welcome; it is ignorable though for now.This comes from the parameter stability tests and might be a result of an unidentified (or close to unidentified) model fit.> The third and really big problem I have is that when I apply mob to > large datasets (say hundreds of thousands of elements) I get a > "logical subscript too long" error inside mob_fit_fluctests. It's > caught in a try(), and mob just gives up and treats the node as > terminal. This is really hurting me though; with 1% of my data I can > get a good fit and a worthwhile tree, but with the whole dataset I get > a very stunted tree with a pretty useless prediction ability.With hundreds of thousands of observations, you would need some additional pruning strategy anyway. Significance test-based splitting will probably overfit because tiny differences in the coefficients will be picked up at such large sample sizes. Furthermore, computationally the extensive search over all possible splits might be too burdensome with this many observations. Hence, using some subsampling strategy might not be the worst thing.> I guess what I really want to know is: > (a) has anyone else had this problem, and if so how did they overcome it?We have had non-identified model fits in binary GLMs (with quasi-complete separation) where we then set estfun() to all zero so that partitioning stops. But I don't think that such a strategy helps here.> (b) is there any way to get a line or stack trace out of a try() > without source modification?Not sure, I don't know any off the top off my head.> (c) failing all of that, does anyone know of an alternative to mob > that does the same thing; for better or worse I'm now committed to > recursive partitioning over linear models, as per mob?If your partitioning variables are particularly simple (e.g., all binary) you could exploit that and it may be easier to write a custom function for your particular data. Then likelihood-ratio tests (rather than LM-type tests) would also be easier to apply in case of unidentified parameters. But if there are partitioning variables with different measurement scales, then this will not be that simple...> (d) failing all of this, does anyone have a link to a way to rebuild, or > locally modify, an R package (preferably windows, but anything would > do)?Have a look at the "Writing R Extensions" manual and the R for Windows FAQ. Best, Z> Sorry for the length of this post. If I should RTFM, please point me > at any relevant manual by all means. I've spent a few days on this as > you can maybe tell, but I'm far from being an R expert. > > Thanks for any help you can give. > > Best wishes, > > Ed > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >