david.beede@mail.doc.gov
2001-Jun-06 16:11 UTC
[R] ppr, number of terms, and data ordering
Dear R listers -- I have several questions about using the ppr command in the modreg module. I discovered -- quite by accident -- that if I re-order the data, I obtain different results. The output below shows what I mean. I have two datasets (dataset1 and dataset2) that are identical (tested using proc compare in SAS) except for the fact that the records are in different order. Below I have pasted in the results from running ppr on the two data sets in their original order (first and third sets of results below) and running ppr after sorting the datasets by idnum (second and third sets of results below). At first I thought that perhaps the regression parameters are different but the underlying results are equivalent but predicted values are significantly different. I tried increasing the bass parameter, thinking that perhaps I was overfitting the data, but the differences in the regression parameters remained. Finally, I originally had lots of other RHS variables, including indicator variables; stripping those variables out did not change my findings, as shown below. My first question is: is there a recommended way to sort the data before running ppr? In the meantime, I'll try sorting by my two continuous RHS variables to see if it makes a difference -- not a definitive answer but it may be suggestive. My second question is: is my method for deciding on the number of terms in the regression okay? What I am doing is first running ppr with a large maximum number of terms, then finding the number of terms that minimizes the goodness-of-fit statistic. Looking at the cpus example in the section of MASS that deals with ppr (pp. 293-294), it is unclear why eight terms were finally chosen, when using ten terms yields a lower goodness-of-fit statistic. Finally, in the same example in MASS, where does the test.cpus() function come from? I couldn't find it in the MASS table of contents on CRAN. Thanks in advance for any help you can give! ######DATASET1, ORIGINAL ORDER> pprfile.ppr <- ppr(+ award~ + ilogemp+ilogage, + data=dataset1, nterms=1, max.terms=40, optlevel=3, bass=0 + )> pprfile.pprCall: ppr.formula(formula = award ~ ilogemp + ilogage, data = dataset1, nterms = 1, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 1 terms 2 terms 3 terms 4 terms 5 terms 6 terms 94153840890 97278012020 86500905681 55028690369 53258726650 48006301832 7 terms 8 terms 9 terms 10 terms 11 terms 12 terms 64844079166 44746090412 4502608255 2047094369 4361229925 2811887563 13 terms 14 terms 15 terms 16 terms 17 terms 18 terms 4960791975 2717103497 2265582134 2288868136 2470815605 4989632044 19 terms 20 terms 21 terms 22 terms 23 terms 24 terms 4966101666 3993722223 4000594447 3925383715 7636913238 7714228211 25 terms 26 terms 27 terms 28 terms 29 terms 30 terms 7378463928 7035211389 7007446263 10399858547 0 0 31 terms 32 terms 33 terms 34 terms 35 terms 36 terms 0 0 0 0 0 0 37 terms 38 terms 39 terms 40 terms 0 0 0 0> numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn > 0]) > summary(update(pprfile.ppr,nterms=numterm))Call: ppr.formula(formula = award ~ ilogemp + ilogage, data = dataset1, nterms = numterm, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 10 terms 11 terms 12 terms 13 terms 14 terms 15 terms 2047094369 4361229925 2811887563 4960791975 2717103497 2265582134 16 terms 17 terms 18 terms 19 terms 20 terms 21 terms 2288868136 2470815605 4989632044 4966101666 3993722223 4000594447 22 terms 23 terms 24 terms 25 terms 26 terms 27 terms 3925383715 7636913238 7714228211 7378463928 7035211389 7007446263 28 terms 29 terms 30 terms 31 terms 32 terms 33 terms 10399858547 0 0 0 0 0 34 terms 35 terms 36 terms 37 terms 38 terms 39 terms 0 0 0 0 0 0 40 terms 0 Projection direction vectors: term 1 term 2 term 3 term 4 term 5 term 6 ilogemp -0.67134667 -0.02873846 -0.73893911 0.18766759 -0.32913203 0.79075441 ilogage -0.74114348 0.99958697 0.67377221 -0.98223260 -0.94428391 -0.61213353 term 7 term 8 term 9 term 10 ilogemp -0.73097017 0.46223139 -0.28989935 0.43839056 ilogage 0.68240941 -0.88675935 0.95705714 0.89878458 Coefficients of ridge terms: term 1 term 2 term 3 term 4 term 5 term 6 term 7 term 8 53896.97 67906.48 33947.21 54279.37 61051.35 67225.76 65528.85 60914.63 term 9 term 10 58372.10 86302.62>######DATASET1, IDNUM ORDER> order3 <- order(dataset1$idnum) > reorder1 <- dataset1[order3,] > pprfile.ppr <- ppr(+ award~ + ilogemp+ilogage, + data=reorder1, nterms=1, max.terms=40, optlevel=3, bass=0 + )> pprfile.pprCall: ppr.formula(formula = award ~ ilogemp + ilogage, data = reorder1, nterms = 1, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 1 terms 2 terms 3 terms 4 terms 5 terms 6 terms 104330711350 96448754204 82123267932 89241900763 62145006339 18715823257 7 terms 8 terms 9 terms 10 terms 11 terms 12 terms 13119857589 34779340331 30300680427 10845449181 25437895985 25390506630 13 terms 14 terms 15 terms 16 terms 17 terms 18 terms 25715475967 32977127206 33617404958 32855949359 31925878071 34135238643 19 terms 20 terms 21 terms 22 terms 23 terms 24 terms 0 0 0 0 0 0 25 terms 26 terms 27 terms 28 terms 29 terms 30 terms 0 0 0 0 0 0 31 terms 32 terms 33 terms 34 terms 35 terms 36 terms 0 0 0 0 0 0 37 terms 38 terms 39 terms 40 terms 0 0 0 0> numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn > 0]) > summary(update(pprfile.ppr,nterms=numterm))Call: ppr.formula(formula = award ~ ilogemp + ilogage, data = reorder1, nterms = numterm, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 10 terms 11 terms 12 terms 13 terms 14 terms 15 terms 10845449181 25437895985 25390506630 25715475967 32977127206 33617404958 16 terms 17 terms 18 terms 19 terms 20 terms 21 terms 32855949359 31925878071 34135238643 0 0 0 22 terms 23 terms 24 terms 25 terms 26 terms 27 terms 0 0 0 0 0 0 28 terms 29 terms 30 terms 31 terms 32 terms 33 terms 0 0 0 0 0 0 34 terms 35 terms 36 terms 37 terms 38 terms 39 terms 0 0 0 0 0 0 40 terms 0 Projection direction vectors: term 1 term 2 term 3 term 4 term 5 term 6 ilogemp 0.53566733 -0.03152731 0.25486277 0.19557035 -0.35437387 0.78441650 ilogage -0.84442910 0.99950289 -0.96697723 -0.98068967 -0.93510382 -0.62023444 term 7 term 8 term 9 term 10 ilogemp -0.01470383 0.46360287 -0.27988056 -0.22789063 ilogage -0.99989189 -0.88604310 0.96003483 0.97368674 Coefficients of ridge terms: term 1 term 2 term 3 term 4 term 5 term 6 term 7 term 8 26233.34 60989.33 34132.51 64068.54 39630.44 40275.67 28783.82 27284.49 term 9 term 10 40799.08 49967.01 ######DATASET2, ORIGINAL ORDER> > > pprfile.ppr <- ppr(+ award~ + ilogemp+ilogage, + data=dataset2, nterms=1, max.terms=40, optlevel=3, bass=0 + )> pprfile.pprCall: ppr.formula(formula = award ~ ilogemp + ilogage, data = dataset2, nterms = 1, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 1 terms 2 terms 3 terms 4 terms 5 terms 6 terms 107514555509 86364622071 76915236151 68363332758 62669868895 66108915227 7 terms 8 terms 9 terms 10 terms 11 terms 12 terms 10960310955 11415371966 8181125026 7370110042 7083942083 7695270325 13 terms 14 terms 15 terms 16 terms 17 terms 18 terms 9383626616 28487721161 25013255493 30617600484 33616699135 37629909488 19 terms 20 terms 21 terms 22 terms 23 terms 24 terms 0 0 0 0 0 0 25 terms 26 terms 27 terms 28 terms 29 terms 30 terms 0 0 0 0 0 0 31 terms 32 terms 33 terms 34 terms 35 terms 36 terms 0 0 0 0 0 0 37 terms 38 terms 39 terms 40 terms 0 0 0 0> numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn > 0]) > summary(update(pprfile.ppr,nterms=numterm))Call: ppr.formula(formula = award ~ ilogemp + ilogage, data = dataset2, nterms = numterm, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 11 terms 12 terms 13 terms 14 terms 15 terms 16 terms 7083942083 7695270325 9383626616 28487721161 25013255493 30617600484 17 terms 18 terms 19 terms 20 terms 21 terms 22 terms 33616699135 37629909488 0 0 0 0 23 terms 24 terms 25 terms 26 terms 27 terms 28 terms 0 0 0 0 0 0 29 terms 30 terms 31 terms 32 terms 33 terms 34 terms 0 0 0 0 0 0 35 terms 36 terms 37 terms 38 terms 39 terms 40 terms 0 0 0 0 0 0 Projection direction vectors: term 1 term 2 term 3 term 4 term 5 ilogemp 0.740357413 -0.002414494 -0.549155789 0.189738006 -0.375109022 ilogage -0.672213435 0.999997085 0.835720000 -0.981834756 -0.926980702 term 6 term 7 term 8 term 9 term 10 ilogemp 0.884994686 0.144491176 0.445318057 -0.279893879 0.249729588 ilogage -0.465601123 -0.989506089 -0.895372452 0.960030946 -0.968315616 term 11 ilogemp -0.245536824 ilogage 0.969387264 Coefficients of ridge terms: term 1 term 2 term 3 term 4 term 5 term 6 term 7 term 8 19688.36 46872.70 48959.17 61038.99 50865.32 41610.74 29375.41 25582.86 term 9 term 10 term 11 51523.50 34629.45 30547.80>######DATASET2, IDNUM ORDER> order4 <- order(dataset2$idnum) > reorder2 <- dataset2[order4,] > pprfile.ppr <- ppr(+ award~ + ilogemp+ilogage, + data=reorder2, nterms=1, max.terms=40, optlevel=3, bass=0 + )> pprfile.pprCall: ppr.formula(formula = award ~ ilogemp + ilogage, data = reorder2, nterms = 1, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 1 terms 2 terms 3 terms 4 terms 5 terms 6 terms 104330711350 96448754204 82123267932 89241900763 62145006339 18715823257 7 terms 8 terms 9 terms 10 terms 11 terms 12 terms 13119857589 34779340331 30300680427 10845449181 25437895985 25390506630 13 terms 14 terms 15 terms 16 terms 17 terms 18 terms 25715475967 32977127206 33617404958 32855949359 31925878071 34135238643 19 terms 20 terms 21 terms 22 terms 23 terms 24 terms 0 0 0 0 0 0 25 terms 26 terms 27 terms 28 terms 29 terms 30 terms 0 0 0 0 0 0 31 terms 32 terms 33 terms 34 terms 35 terms 36 terms 0 0 0 0 0 0 37 terms 38 terms 39 terms 40 terms 0 0 0 0> numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn > 0]) > summary(update(pprfile.ppr,nterms=numterm))Call: ppr.formula(formula = award ~ ilogemp + ilogage, data = reorder2, nterms = numterm, max.terms = 40, optlevel = 3, bass = 0) Goodness of fit: 10 terms 11 terms 12 terms 13 terms 14 terms 15 terms 10845449181 25437895985 25390506630 25715475967 32977127206 33617404958 16 terms 17 terms 18 terms 19 terms 20 terms 21 terms 32855949359 31925878071 34135238643 0 0 0 22 terms 23 terms 24 terms 25 terms 26 terms 27 terms 0 0 0 0 0 0 28 terms 29 terms 30 terms 31 terms 32 terms 33 terms 0 0 0 0 0 0 34 terms 35 terms 36 terms 37 terms 38 terms 39 terms 0 0 0 0 0 0 40 terms 0 Projection direction vectors: term 1 term 2 term 3 term 4 term 5 term 6 ilogemp 0.53566733 -0.03152731 0.25486277 0.19557035 -0.35437387 0.78441650 ilogage -0.84442910 0.99950289 -0.96697723 -0.98068967 -0.93510382 -0.62023444 term 7 term 8 term 9 term 10 ilogemp -0.01470383 0.46360287 -0.27988056 -0.22789063 ilogage -0.99989189 -0.88604310 0.96003483 0.97368674 Coefficients of ridge terms: term 1 term 2 term 3 term 4 term 5 term 6 term 7 term 8 26233.34 60989.33 34132.51 64068.54 39630.44 40275.67 28783.82 27284.49 term 9 term 10 40799.08 49967.01 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 6 Jun 2001 david.beede at mail.doc.gov wrote:> > Dear R listers -- > > I have several questions about using the ppr command in the modreg module. > > I discovered -- quite by accident -- that if I re-order the data, I obtain different > results. The output below shows what I mean. I have two datasets (dataset1 and dataset2) > that are identical (tested using proc compare in SAS) except for the fact that the records > are in different order. Below I have pasted in the results from running ppr on the two > data sets in their original order (first and third sets of results below) and running > ppr after sorting the datasets by idnum (second and third sets of results below).Because there can be multiple local minima in the fit criterion, this is not at all unusual. Your mistake is assuming that there is a single possible answer. Same with nnet, and ppr is a more general form of neural nets. Not so widely known, but true of gams too. Having said that, I think you are vastly over-fitting. The usual idea is to have fewer terms than explanatory variables, or not many more.> At first I thought that perhaps the regression parameters are different but the underlying > results are equivalent but predicted values are significantly different. I tried > increasing the bass parameter, thinking that perhaps I was overfitting the data, > but the differences in the regression parameters remained. Finally, I originally had lots > of other RHS variables, including indicator variables; stripping those variables out > did not change my findings, as shown below. > > My first question is: is there a recommended way to sort the data before running ppr? > In the meantime, I'll try sorting by my two continuous RHS variables to see if it makes > a difference -- not a definitive answer but it may be suggestive.No. But using a smoother other than supsmu (like smoothing splines) is usually a better idea. If there is a sensible model then most orderings (or machines or compiler levels of optimization ...) will give similar fits. Seeing instability is probably a sign that the model is not good.> My second question is: is my method for deciding on the number of terms in the regression > okay? What I am doing is first running ppr with a large maximum number of terms, then > finding the number of terms that minimizes the goodness-of-fit statistic. Looking at the > cpus example in the section of MASS that deals with ppr (pp. 293-294), it is unclear why > eight terms were finally chosen, when using ten terms yields a lower goodness-of-fit > statistic.Because in theory the goodness of fit decreases (numerically) with the number of terms, although the optimization is inexact. One is looking for a break in slope in the decrease.> Finally, in the same example in MASS, where does the test.cpus() function come from? I > couldn't find it in the MASS table of contents on CRAN.Because it is in MASS the book, not MASS the package. Try the scripts in the MASS package, specifically ch06.R or ch09.R. Probably you need to read a lot more about the background, although the papers are scattered and often inaccessible. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._