thr3ads.net - R devel - [Rd] R and Gnumeric [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Jean Bréfort

2008-Jun-08 11:27 UTC

[Rd] R and Gnumeric

Hi,

I just read the "Embedding R in Gnumeric" idea at
http://www.r-project.org/SoC08/ideas.html. On my side, I intend to add
as many statistics related plot types to the current gnumeric charting
engine as possible. We already have boxplots and partial support for
histograms. My immediate plans are to finish the histogram code and add
probability plots (http://bugzilla.gnome.org/show_bug.cgi?id=500168)
during the summer if time permits (importing some code from R).
For the future, I see two options: either add all necessary plot types
to the gnumeric charting engine or embedding R charts directly using
either a new SheetObject class or the goffice component system (which
would allow inserting these charts in abiword as well).

One other totally unrelated thing. We got recently a bug report about an
incorrect R squared in gnumeric regression code
(http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
give the same result as Gnumeric as can be seen below:
> mydata <- read.csv(file="data.csv",sep=",")
> mydata  X  Y
1 1  2
2 2  4
3 3  5
4 4  8
5 5  0
6 6  7
7 7  8
8 8  9
9 9 10> summary(lm(mydata$Y~mydata$X))
Call:
lm(formula = mydata$Y ~ mydata$X)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8889  0.2444  0.5111  0.7111  2.9778 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.5556     1.8587   0.837   0.4303  
mydata$X      0.8667     0.3303   2.624   0.0342 *
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 

Residual standard error: 2.559 on 7 degrees of freedom
Multiple R-squared: 0.4958,	Adjusted R-squared: 0.4238 
F-statistic: 6.885 on 1 and 7 DF,  p-value: 0.03422 
> summary(lm(mydata$Y~mydata$X-1))
Call:
lm(formula = mydata$Y ~ mydata$X - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5614  0.1018  0.3263  1.6632  3.5509 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
mydata$X   1.1123     0.1487   7.481 7.06e-05 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 

Residual standard error: 2.51 on 8 degrees of freedom
Multiple R-squared: 0.8749,	Adjusted R-squared: 0.8593 
F-statistic: 55.96 on 1 and 8 DF,  p-value: 7.056e-05 

I am unable to figure out what this 0.8749 value might represent. If it
is intended to be the Pearson moment, it should be 0.4958, and if it is
the coefficient of determination, I think the correct value would be
0.4454 as given by Excel. It's of course nice to have the same result in
R and Gnumeric,but it would be better if this result was accurate (if it
is, we need some documentation fix). Btw, I am not a statistics expert
at all.

Best regards,
Jean Brefort

Peter Dalgaard

2008-Jun-09 14:34 UTC

head link

[Rd] R and Gnumeric

Jean Br?fort wrote:> One other totally unrelated thing. We got recently a bug report about an
> incorrect R squared in gnumeric regression code
> (http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
> give the same result as Gnumeric as can be seen below:
>
>   
>> mydata <- read.csv(file="data.csv",sep=",")
>> mydata
>>     
>   X  Y
> 1 1  2
> 2 2  4
> 3 3  5
> 4 4  8
> 5 5  0
> 6 6  7
> 7 7  8
> 8 8  9
> 9 9 10
>   
>> summary(lm(mydata$Y~mydata$X))
>>     
>
> Call:
> lm(formula = mydata$Y ~ mydata$X)
>
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -5.8889  0.2444  0.5111  0.7111  2.9778 
>
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)  
> (Intercept)   1.5556     1.8587   0.837   0.4303  
> mydata$X      0.8667     0.3303   2.624   0.0342 *
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
>
> Residual standard error: 2.559 on 7 degrees of freedom
> Multiple R-squared: 0.4958,	Adjusted R-squared: 0.4238 
> F-statistic: 6.885 on 1 and 7 DF,  p-value: 0.03422 
>
>   
>> summary(lm(mydata$Y~mydata$X-1))
>>     
>
> Call:
> lm(formula = mydata$Y ~ mydata$X - 1)
>
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -5.5614  0.1018  0.3263  1.6632  3.5509 
>
> Coefficients:
>          Estimate Std. Error t value Pr(>|t|)    
> mydata$X   1.1123     0.1487   7.481 7.06e-05 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
>
> Residual standard error: 2.51 on 8 degrees of freedom
> Multiple R-squared: 0.8749,	Adjusted R-squared: 0.8593 
> F-statistic: 55.96 on 1 and 8 DF,  p-value: 7.056e-05 
>
> I am unable to figure out what this 0.8749 value might represent. If it
> is intended to be the Pearson moment, it should be 0.4958, and if it is
> the coefficient of determination, I think the correct value would be
> 0.4454 as given by Excel. It's of course nice to have the same result
in
> R and Gnumeric,but it would be better if this result was accurate (if it
> is, we need some documentation fix). Btw, I am not a statistics expert
> at all.
>   This horse has been flogged multiple times on the list.

It is of course mainly a matter of convention, but the convention used
by R has been around at least since Genstat in the mid-1970s. In the
no-intercept case, you get the _uncentered_ version of R-squared; that
is, the proportion of the sum of squares explained by the model (as
opposed to sum of squares of _deviations_ in the usual case.) The
rationale is that the R^2 should be based on a reduction in residual
variation between two nested models, and if theres no intercept, the
only well-determined nested model is the one where mydata$Y has mean
zero for all x corresponding to all-zero regression coefficients. The
resulting R^2 is directly related to the F statistic, which you'll see
is also larger and more significant when the intercept is removed.

BTW:  lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata).
Use of predict() will demonstrate why.

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Reasonably Related Threads

Search for more possibly parallel threads

R devel - Jun 2008 - R and Gnumeric

[Rd] R and Gnumeric

[Rd] R and Gnumeric

Reasonably Related Threads