thr3ads.net - R devel - [Rd] scale(x, center=FALSE) (PR#14219) [Feb 2010]

If this information is useful, please help other people find it:
Share via:

mrizzo at bgsu.edu

2010-Feb-22 03:30 UTC

[Rd] scale(x, center=FALSE) (PR#14219)

Full_Name: Maria Rizzo
Version: 2.10.1 (2009-12-14) 
OS: Windows XP SP3
Submission from: (NULL) (72.241.75.222)


platform       i386-pc-mingw32              
arch           i386                         
os             mingw32                      
system         i386, mingw32                
status                                      
major          2                            
minor          10.1                         
year           2009                         
month          12                           
day            14                           
svn rev        50720                        
language       R                            
version.string R version 2.10.1 (2009-12-14)

scale returns incorrect values when center=FALSE and scale=TRUE.

When center=FALSE, scale=TRUE, the "scale" used is not the square root
of sample
variance, the "scale" attribute is equal to sqrt(sum(x^2)/(n-1)).

Example:

x <- runif(10)
n <- length(x)

scaled <- scale(x, center=FALSE, scale=TRUE)
scaled
s.bad <- attr(scaled, "scale")
s.bad  #wrong
sd(x)  #correct

#compute the sd as if data has already been centered
#that is, compute the variance as sum(x^2)/(n-1)

sqrt(sum(x^2)/(n-1))

Ben Bolker

2010-Feb-25 14:30 UTC

head link

[Rd] scale(x, center=FALSE) (PR#14219)

<mrizzo <at> bgsu.edu> writes:
> scale returns incorrect values when center=FALSE and scale=TRUE.
> 
> When center=FALSE, scale=TRUE, the "scale" used is not 
> the square root of sample
> variance, the "scale" attribute is equal to sqrt(sum(x^2)/(n-1)).
> 
> Example:
> 
> x <- runif(10)
> n <- length(x)
> 
> scaled <- scale(x, center=FALSE, scale=TRUE)
> scaled
> s.bad <- attr(scaled, "scale")
> s.bad  #wrong
> sd(x)  #correct
> 
> #compute the sd as if data has already been centered
> #that is, compute the variance as sum(x^2)/(n-1)
> 
> sqrt(sum(x^2)/(n-1))
> 
> 
  Are you sure this is a bug? I agree that the way the function
behaves is (to me) mildly confusing, but the documentation says:

* The value of ?scale? determines how column scaling is performed
* (after centering).  If ?scale? is a numeric vector with length
* equal to the number of columns of ?x?, then each column of ?x? is
* divided by the corresponding value from ?scale?.  If ?scale? is
* ?TRUE? then scaling is done by dividing the (centered) columns of
* ?x? by their standard deviations, and if ?scale? is ?FALSE?, no
* scaling is done.

* The standard deviation for a column is obtained by computing the
* square-root of the sum-of-squares of the non-missing values in the
* column divided by the number of non-missing values minus one
* (whether or not centering was done).

  If you read the first clause of the last sentence of the first
paragraph in isolation, you would have the expectation that the
columns would be scaled by sd(x).  However, the second paragraph
clearly states that the 'standard deviation' is defined here
as the root-mean-square over (n-1), that is, sqrt(sum(x^2)/(n-1)) ...

  This does seem like a funny choice, but it is probably stuck
that way without an extremely compelling argument to the contrary.
If you want to scale columns by sd() instead you can say

scale(x,center=FALSE,scale=apply(x,2,sd))

  Would you like to submit a patch for the documentation that
would preserve the sense, clarify the behavior, and not be
much longer than the current version ... ?

  cheers
    Ben Bolker

Ben Bolker

2010-Feb-26 17:44 UTC

head link

[Rd] scale(x, center=FALSE) (PR#14219)

[cc'ing back to r-devel]

Maria Rizzo wrote:> Ben,
> 
> I receive the digest version of r-devel - so I do not have the 
> individual messages to reply to. In reply to yours:
> 
> I think this is a bug for the following reasons. While it is true 
> that one can define a scale factor differently for different 
> purposes, one would hope that within a given function the definition 
> does not vary. If we agree that we want to divide by standard 
> deviation, which scales data to sd=1, then why would we choose to 
> divide by square root of 1/(n-1) times sum of squares of the data 
> when data is not centered? This does not scale the data to sd=1.
  This is really a disagreement with the way the function is implemented
(and I happen to agree with you), but I would argue that it is *not* a
bug in the strict sense -- I would call it a "misfeature".

  From the R FAQ:
> Finally, a command's intended definition may not be best for 
> statistical analysis. This is a very important sort of problem, but
> it is also a matter of judgment. 

 [snip]
>> Are you sure this is a bug? I agree that the way the function 
>> behaves is (to me) mildly confusing, but the documentation says:
>> 
>> * The value of ?scale? determines how column scaling is performed *
>>  (after centering).  If ?scale? is a numeric vector with length * 
>> equal to the number of columns of ?x?, then each column of ?x? is *
>>  divided by the corresponding value from ?scale?.  If ?scale? is * 
>> ?TRUE? then scaling is done by dividing the (centered) columns of *
>>  ?x? by their standard deviations, and if ?scale? is ?FALSE?, no * 
>> scaling is done.
>> 
>> * The standard deviation for a column is obtained by computing the
>>  * square-root of the sum-of-squares of the non-missing values in 
>> the * column divided by the number of non-missing values minus one
>>  * (whether or not centering was done).
>> 
>> If you read the first clause of the last sentence of the first 
>> paragraph in isolation, you would have the expectation that the 
>> columns would be scaled by sd(x).  However, the second paragraph 
>> clearly states that the 'standard deviation' is defined here as
the
>>  root-mean-square over (n-1), that is, sqrt(sum(x^2)/(n-1)) ...
> 
> This conflicts with the paragraph above it. What I see is that the 
> (centered) columns are divided by their standard deviations, where 
> (centered) is inserted or not before "columns" depending on
whether
> center=TRUE or center=FALSE. Why modify the definition of "standard 
> deviation"? Why compute the standard deviation of the centered data 
> when data is not centered? This measures standard deviation with 
> respect to the origin rather than measuring dispersion about the 
> mean.
>> This does seem like a funny choice, but it is probably stuck that 
>> way without an extremely compelling argument to the contrary. If 
>> you want to scale columns by sd() instead you can say
>> 
>> scale(x,center=FALSE,scale=apply(x,2,sd))
> 
> Of course, I know how to achieve the result of scaling my data to 
> sd=1. The problem is that a function called scale with options of 
> center=TRUE or center=FALSE, should apply the same definition of 
> scale if scale=TRUE in both cases.
  "should" according to you ...> 
>> Would you like to submit a patch for the documentation that would 
>> preserve the sense, clarify the behavior, and not be much longer 
>> than the current version ... ?
> 
> For me the problem is deeper than an issue with the documentation. In
>  any case, I think that it is a potential source of confusion and 
> errors on the part of users.
> 
> regards, Maria
> 
  Again, I agree with you that the behavior is not optimal, but it is
very hard to make changes in R when the behavior is sub-optimal rather
than actually wrong (by some definition).  R-core is very conservative
about changes that break backward compatibility; I would like it if they
chose to change the function to use standard deviation rather than
root-mean-square, but I doubt it will happen (and it would break things
for any users who are relying on the current definition).

  It turns out that the documentation for this function was changed on
25 Nov 2009 to clarify this issue, but I think the change (which among
other minor changes modified the previous use of "root mean square" to
"standard deviation") didn't help that much ...  I have attached a
patch
file (and append the information below as well) that changes "standard
deviation" back to "root mean square" and is much more explicit
about
this issue ... I hope R-core will jump in, critique it, and possibly use
it in some form to improve (?) the documentation ...

  [PS: I have written that the scaling is equivalent to sd() "if and
only if" centering was done.  Technically it would also be equivalent if
the column already had zero mean ...]

==================================================================--- scale.Rd
(revision 51180)
+++ scale.Rd	(working copy)
@@ -41,13 +41,18 @@
   equal to the number of columns of \code{x}, then each column of
   \code{x} is divided by the corresponding value from \code{scale}.  If
   \code{scale} is \code{TRUE} then scaling is done by dividing the
-  (centered) columns of \code{x} by their standard deviations, and if
+  (centered) columns of \code{x} by their root-mean-squares, and if
   \code{scale} is \code{FALSE}, no scaling is done.
-
-  The standard deviation for a column is obtained by computing the
-  square-root of the sum-of-squares of the non-missing values in the
-  column divided by the number of non-missing values minus one (whether
-  or not centering was done).
+
+  The root-mean-square for a (possibly centered)
+  column is defined as
+  \eqn{\sqrt{\sum(x^2)/(n-1)}}{sqrt(sum(x^2)/(n-1))},
+  where \eqn{x} is a vector of the non-missing values
+  and \eqn{n} is the number of non-missing values.
+  If (and only if) centering was done,
+  this is equivalent to \code{sd(x,na.rm=TRUE)}.
+  (To scale by the standard deviations without centering,
+  use \code{scale(x,center=FALSE,scale=apply(x,2,sd,na.rm=TRUE))}.)
 }
 \references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scale.Rd.patch
Type: text/x-patch
Size: 1340 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20100226/b83e9d8b/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 261 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20100226/b83e9d8b/attachment-0001.bin>

Ben Bolker

2010-Mar-12 18:29 UTC

head link

[Rd] scale(x, center=FALSE) (PR#14219)

I'm resending this after a week ... I really don't want to nag, but
I also would not like to see this sink below the waves.

  Is there a preferred protocol for requesting comments without nagging
too much?   I would add a comment to 14219 (and was curious to see
whether it was rejected) ... I went to bugzilla, and bug 14219 doesn't
seem to exist any more -- either as open or as closed -- don't know if
it got lost, or thrown away, when the bug system migrated?

   cheers
     Ben Bolker


 [re: behavior of scale() when center=FALSE and scale=TRUE]
>   Again, I agree with you that the behavior is not optimal, but it is
> very hard to make changes in R when the behavior is sub-optimal rather
> than actually wrong (by some definition).  R-core is very conservative
> about changes that break backward compatibility; I would like it if they
> chose to change the function to use standard deviation rather than
> root-mean-square, but I doubt it will happen (and it would break things
> for any users who are relying on the current definition).
[snip]
>  I have attached a patch
> file (and append the information below as well) that changes "standard
> deviation" back to "root mean square" and is much more
explicit about
> this issue ... I hope R-core will jump in, critique it, and possibly use
> it in some form to improve (?) the documentation ...
>
>   [PS: I have written that the scaling is equivalent to sd() "if and
> only if" centering was done.  Technically it would also be equivalent
if
> the column already had zero mean ...]
>==================================================================--- scale.Rd
(revision 51180)
+++ scale.Rd	(working copy)
@@ -41,13 +41,18 @@
   equal to the number of columns of \code{x}, then each column of
   \code{x} is divided by the corresponding value from \code{scale}.  If
   \code{scale} is \code{TRUE} then scaling is done by dividing the
-  (centered) columns of \code{x} by their standard deviations, and if
+  (centered) columns of \code{x} by their root-mean-squares, and if
   \code{scale} is \code{FALSE}, no scaling is done.
-
-  The standard deviation for a column is obtained by computing the
-  square-root of the sum-of-squares of the non-missing values in the
-  column divided by the number of non-missing values minus one (whether
-  or not centering was done).
+
+  The root-mean-square for a (possibly centered)
+  column is defined as
+  \eqn{\sqrt{\sum(x^2)/(n-1)}}{sqrt(sum(x^2)/(n-1))},
+  where \eqn{x} is a vector of the non-missing values
+  and \eqn{n} is the number of non-missing values.
+  If (and only if) centering was done,
+  this is equivalent to \code{sd(x,na.rm=TRUE)}.
+  (To scale by the standard deviations without centering,
+  use \code{scale(x,center=FALSE,scale=apply(x,2,sd,na.rm=TRUE))}.)
 }
\references{
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)

 (Bump re: suggested update to scale.Rd .  Is this under
consideration? I'll stop pestering if it's considered
unacceptable, just don't want it to vanish without a trace ...)


-- 
Ben Bolker
Associate professor, Biology Dep't, Univ. of Florida
bolker at ufl.edu / people.biology.ufl.edu/bolker
GPG key: people.biology.ufl.edu/bolker/benbolker-publickey.asc

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 261 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20100312/cae8fd56/attachment.bin>

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Feb 2010 - scale(x, center=FALSE) (PR#14219)

[Rd] scale(x, center=FALSE) (PR#14219)

[Rd] scale(x, center=FALSE) (PR#14219)

[Rd] scale(x, center=FALSE) (PR#14219)

[Rd] scale(x, center=FALSE) (PR#14219)

Maybe Matching Threads