Hi, Standard correlations (Pearson's, Spearman's, Kendall's Tau) do not accurately reflect how closely the model (GAM) fits the data. I was told that the accuracy of the correlation can be improved using a root mean square deviation (RMSD) calculation on binned data. For example, let 'o' be the real, observed data and 'm' be the model data. I believe I can calculate the root mean squared deviation as: sqrt( mean( o - m ) ^ 2 ) However, this does not bin the data into mean sets. What I would like to do is: oangry <- c( mean(o[1:5]), mean(o[6:10]), ... ) mangry <- c( mean(m[1:5]), mean(m[6:10]), ... ) Then: sqrt( mean( oangry - mangry ) ^ 2 ) That calculation I would like to simplify into (or similar to): sqrt( mean( bin( o, 5 ) - bin( m, 5 ) ) ^ 2 ) I have read the help for ?cut, ?table, ?hist, and ?split, but am stumped for which one to use in this case--if any. How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector using an appropriate number of bins (fixed at 5, or perhaps calculated using Sturges' formula)? I have also posted a more detailed version of this question on StackOverflow: http://stackoverflow.com/questions/3073365/root-mean-square-deviation-on-binned-gam-results-using-r Many thanks. Dave [[alternative HTML version deleted]]
Hi, To calculate the mean of binned data from an arbitrary length vector 'd', the following works: d1 <- runif( 67,0,9 ) while( length(d1) %% 5 != 0 ) { d1 <- d1[-length(d1)] } dmean1 <- apply( matrix(d1, 5), 2, mean ) Unfortunately, this means dropping (two) data points from the end before I can execute: sqrt( mean( dmean1 - dmean2 ) ^ 2 ) Where 'dmean2' is an array of values constructed from running GAM on 'dmean1'. What is a better way to do this? Many thanks! Dave [[alternative HTML version deleted]]
Don't know about the correlations (never used them in a gam context actually...), but you can "bin" the mean by :> x <- 1:100 > tapply(x,cut(x,5),mean)(0.901,20.7] (20.7,40.6] (40.6,60.4] (60.4,80.3] (80.3,100] 10.5 30.5 50.5 70.5 90.5 Cheers Joris On Sat, Jun 19, 2010 at 1:54 AM, David Jarvis <thangalin at gmail.com> wrote:> Hi, > > Standard correlations (Pearson's, Spearman's, Kendall's Tau) do not > accurately reflect how closely the model (GAM) fits the data. I was told > that the accuracy of the correlation can be improved using a root mean > square deviation (RMSD) calculation on binned data. > > For example, let 'o' be the real, observed data and 'm' be the model data. I > believe I can calculate the root mean squared deviation as: > > sqrt( mean( o - m ) ^ 2 ) > > However, this does not bin the data into mean sets. What I would like to do > is: > > oangry <- c( mean(o[1:5]), mean(o[6:10]), ... ) > mangry <- c( mean(m[1:5]), mean(m[6:10]), ... ) > > Then: > > sqrt( mean( oangry - mangry ) ^ 2 ) > > That calculation I would like to simplify into (or similar to): > > sqrt( mean( bin( o, 5 ) - bin( m, 5 ) ) ^ 2 ) > > I have read the help for ?cut, ?table, ?hist, and ?split, but am stumped for > which one to use in this case--if any. > > How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary > length vector using an appropriate number of bins (fixed at 5, or perhaps > calculated using Sturges' formula)? > > I have also posted a more detailed version of this question on > StackOverflow: > > http://stackoverflow.com/questions/3073365/root-mean-square-deviation-on-binned-gam-results-using-r > > Many thanks. > > Dave > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
On Jun 18, 2010, at 7:54 PM, David Jarvis wrote:> Hi, > > Standard correlations (Pearson's, Spearman's, Kendall's Tau) do not > accurately reflect how closely the model (GAM) fits the data. I was > told > that the accuracy of the correlation can be improved using a root mean > square deviation (RMSD) calculation on binned data.By whom? ... and with what theoretical basis?> > For example, let 'o' be the real, observed data and 'm' be the model > data. I > believe I can calculate the root mean squared deviation as: > > sqrt( mean( o - m ) ^ 2 ) > > However, this does not bin the data into mean sets. What I would > like to do > is: > > oangry <- c( mean(o[1:5]), mean(o[6:10]), ... ) > mangry <- c( mean(m[1:5]), mean(m[6:10]), ... ) > > Then: > > sqrt( mean( oangry - mangry ) ^ 2 ) > > That calculation I would like to simplify into (or similar to): > > sqrt( mean( bin( o, 5 ) - bin( m, 5 ) ) ^ 2 )I doubt that your strategy offers any statistical advantage, but if you want to play around with it then consider: binned.x <- round( (x + 2.5)/5) -- David.> > I have read the help for ?cut, ?table, ?hist, and ?split, but am > stumped for > which one to use in this case--if any. > > How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an > arbitrary > length vector using an appropriate number of bins (fixed at 5, or > perhaps > calculated using Sturges' formula)? > > I have also posted a more detailed version of this question on > StackOverflow: > > http://stackoverflow.com/questions/3073365/root-mean-square-deviation-on-binned-gam-results-using-r > > Many thanks. > > Dave > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Just for the record, if you have NA's in it, you do : tapply(d,cut(d,round(length(d)/5)),mean, na.rm=T) tapply applies a function over a vector by groups defined by another vector. In this case, it applies the function mean with the argument na.rm=T over the vector d by the groups defined by the cut function. cut splits a numeric vector in bins of equal size. In this case the vector is d and the amount of bins is round(length(d)/5). Cheers Joris On Sun, Jun 20, 2010 at 1:24 AM, Joris Meys <jorismeys at gmail.com> wrote:> On Sat, Jun 19, 2010 at 4:29 AM, David Jarvis <thangalin at gmail.com> wrote: >> Hi, Joris. >> >> Thanks again; I don't get it. Reading the help pages for R reminds me of >> reading the manual pages for Unix: great for people who already know what it >> means. > > Just read them, from top to bottom, and take a look at the examples. > If you scare away from them, forget about ever finding your way around > R. Never skip the details, and just run the examples at the bottom. > Then you can see what's going on, it often clarifies things a whole > lot. > >> >> I can see how cut is dividing the data into 14 rows, and I can take the >> factor results from cut: >> >> tapply(d,cut(d,round(length(d)/5)),mean) >> >> But the results are ... well, negative? > > That is explained in the help file. The left side is not included in > the interval - (0,1] is equivalent to ]0,1]. To include the extreme > values, the lower limit is extended with 0.1% of the range. > >> >>> tapply(d,cut(d,round(length(d)/5)),mean) >> (-0.009,0.685]?? (0.685,1.38]??? (1.38,2.07]??? (2.07,2.77]??? (2.77,3.46] >> ???????????? 0????????????? 1????????????? 2???????????? NA????????????? 3 >> ?? (3.46,4.15]??? (4.15,4.85]??? (4.85,5.54]??? (5.54,6.23]??? (6.23,6.93] >> ???????????? 4???????????? NA????????????? 5????????????? 6???????????? NA >> ?? (6.93,7.62]??? (7.62,8.32]??? (8.32,9.01] >> ???????????? 7????????????? 8????????????? 9 >> >> I don't see how rounding up with ceiling would apply. > > well : 67/5 = 13,4. Round gives 13 bins, ceiling gives 14 bins. It's a > matter of choice. > >> >> I appreciate your patience; I think this might be beyond my capacity to >> understand. > > You ain't stupid. Lazy maybe, but definitely not stupid ;) > > Cheers > Joris > > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Applied mathematics, biometrics and process control > > tel : +32 9 264 59 87 > Joris.Meys at Ugent.be > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php