Hi, We got a question about interpretating R-suqared. The actual outputs for a test dataset is X=(x1,x2, ..., xn). model 1 predicted the outputs as Y1=(y11,y12,..., y1n) model n predicted the outputs as Y2=(y21,y22,..., y2n) ... model m predicted the outputs as Ym=(ym1,ym2,..., ymn) Now we have two ways to calculate R squared to evaluate the average performance of committee model. (a) Calculate R squared between (X, Y1), (X, Y2), ..., (X,Ym), and then averaging the R squared (b) Calculate average Y=(Y1+Y2, + ... Ym)/m, and then calculate the R squared between (X, Y). We found it seemed that R squared calculated in (b) is 'always' higher than that in (a). Does this result depends on the test dataset or this happened by chance?Can you advise me any reference for this issue? Many thanks in advance! Kan --------------------------------- [[alternative HTML version deleted]]
Suppose m=2, Y1=Y and Y2= -Y. Then (b) is zero so (a) must be greater or equal to (b). Thus (b) is not necessarily greater than (a). kan Liu <kan_liu1 <at> yahoo.com> writes: : : Hi, : : We got a question about interpretating R-suqared. : : The actual outputs for a test dataset is X=(x1,x2, ..., xn). : model 1 predicted the outputs as Y1=(y11,y12,..., y1n) : model n predicted the outputs as Y2=(y21,y22,..., y2n) : : ... : model m predicted the outputs as Ym=(ym1,ym2,..., ymn) : : Now we have two ways to calculate R squared to evaluate the average performance of committee model. : : (a) Calculate R squared between (X, Y1), (X, Y2), ..., (X,Ym), and then averaging the R squared : (b) Calculate average Y=(Y1+Y2, + ... Ym)/m, and then calculate the R squared between (X, Y). : : We found it seemed that R squared calculated in (b) is 'always' higher than that in (a). : : Does this result depends on the test dataset or this happened by chance?Can you advise me any reference for : this issue? : : Many thanks in advance! : : Kan : : : : --------------------------------- : : [[alternative HTML version deleted]] : : ______________________________________________ : R-help <at> stat.math.ethz.ch mailing list : https://www.stat.math.ethz.ch/mailman/listinfo/r-help : PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html : :
Hi, We got a question about interpretating R-suqared. The actual outputs for a test dataset is X=(x1,x2, ..., xn). model 1 predicted the outputs as Y1=(y11,y12,..., y1n) model n predicted the outputs as Y2=(y21,y22,..., y2n) ... model m predicted the outputs as Ym=(ym1,ym2,..., ymn) Now we have two ways to calculate R squared to evaluate the average performance of committee model. (a) Calculate R squared between (X, Y1), (X, Y2), ..., (X,Ym), and then averaging the R squared (b) Calculate average Y=(Y1+Y2, + ... Ym)/m, and then calculate the R squared between (X, Y). We found it seemed that R squared calculated in (b) is 'always' higher than that in (a). Does this result depends on the test dataset or this happened by chance?Can you advise me any reference for this issue? Many thanks in advance! Kan
The Y1, Y2, etc. that Kan mentioned are predicted values of a test set data from models that supposedly were fitted to the same (or similar) data. It's hard for me to imagine the outcome would be as `severe' as Y1 = -Y2. That said, I do not think that the R-squared (or q-squared as some call it) of the aggregate model is necessarily larger or equal to the average R-squared of the component models. It obviously depends on how the component models are generated. As a hypothetical example (because I haven't acutally tried it, just speculating): Suppose the data are generated from a step function, the sort that would be perfect for regression trees. If one grows several well-pruned trees, I'd guess that the average R-squared of the individual trees has a chance of being larger than the R-squared of the averaged model. Best, Andy> From: Gabor Grothendieck > > Suppose m=2, Y1=Y and Y2= -Y. Then (b) is zero so (a) must be > greater or equal to (b). Thus (b) is not necessarily greater > than (a). > > > kan Liu <kan_liu1 <at> yahoo.com> writes: > > : > : Hi, > : > : We got a question about interpretating R-suqared. > : > : The actual outputs for a test dataset is X=(x1,x2, ..., xn). > : model 1 predicted the outputs as Y1=(y11,y12,..., y1n) > : model n predicted the outputs as Y2=(y21,y22,..., y2n) > : > : ... > : model m predicted the outputs as Ym=(ym1,ym2,..., ymn) > : > : Now we have two ways to calculate R squared to evaluate the average > performance of committee model. > : > : (a) Calculate R squared between (X, Y1), (X, Y2), ..., > (X,Ym), and then > averaging the R squared > : (b) Calculate average Y=(Y1+Y2, + ... Ym)/m, and then > calculate the R > squared between (X, Y). > : > : We found it seemed that R squared calculated in (b) is > 'always' higher than > that in (a). > : > : Does this result depends on the test dataset or this > happened by chance?Can > you advise me any reference for > : this issue? > : > : Many thanks in advance! > : > : Kan > : > : > : > : --------------------------------- > : > : [[alternative HTML version deleted]] > : > : ______________________________________________ > : R-help <at> stat.math.ethz.ch mailing list > : https://www.stat.math.ethz.ch/mailman/listinfo/r-help > : PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > : > : > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >