thr3ads.net - R help - [R] How hist() decides breaks? [May 2008]

If this information is useful, please help other people find it:
Share via:

(Ted Harding)

2008-May-19 09:31 UTC

[R] How hist() decides breaks?

Hi Folks,
I'd like to know how hist() decides how many cells to use
when it ignores my "suggestion" to use say
'hist(...,breaks=50)'.

More specifically, I have the results of 10000 simulations,
each returning an 8-vector, therefore 8 variables each with
10000 values. Some of these 8 have somewhat skew distributions.
Say one of these 8 variables is X.

I ask for H <- hist(X,breaks=50), and get a histogram which
usually has a different number of cells than what I intended.

For instance, for one of these simulations, the 8 different
values of length(H$breaks) are:

  70, 44, 38, 68, 50, 40, 46, 45

?hist tells me

A)
  breaks: one of:
    *  a vector giving the breakpoints between histogram
       cells,
    *  a single number giving the number of cells for the
       histogram,
    *  a character string naming an algorithm to compute the
       number of cells (see Details),
    *  a function to compute the number of cells.

    In the last three cases the number is a suggestion only. 

B)
  The default for 'breaks' is '"Sturges"': see
'nclass.Sturges'.

If I look at the code for nclass.Sturges() I see

  function (x) ceiling(log2(length(x)) + 1)

and, for length(X) = 10000, this gives 15. This is not related
to any of the numbers of breaks I actually got, in any way obvious
to me.

So:
Question 1: hist() has apparently ignored my "suggestion" of
  "break=50". Why? What is the criterion for ignoring?

Question 2: Presumably, if it ignores the "suggestion", it
  does something else, of its choice. I would then, perhaps,
  expect it to fall back to its default, which is (allegedly)
  Sturges. But the result from nclass.Sturges looks different
  from what it actually did. So what did it actually do, and
  how did it decide on this?

With thanks,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 19-May-08                                       Time: 10:31:20
------------------------------ XFMail ------------------------------

jim holtman

2008-May-19 09:38 UTC

head link

[R] How hist() decides breaks?

Why don't you specifically tell hist what breaks to use:

hist(x, breaks=seq(min(x), max(x), length=50), include.lowest=TRUE)

On Mon, May 19, 2008 at 5:31 AM, Ted Harding
<Ted.Harding@manchester.ac.uk>
wrote:
> Hi Folks,
> I'd like to know how hist() decides how many cells to use
> when it ignores my "suggestion" to use say
'hist(...,breaks=50)'.
>
> More specifically, I have the results of 10000 simulations,
> each returning an 8-vector, therefore 8 variables each with
> 10000 values. Some of these 8 have somewhat skew distributions.
> Say one of these 8 variables is X.
>
> I ask for H <- hist(X,breaks=50), and get a histogram which
> usually has a different number of cells than what I intended.
>
> For instance, for one of these simulations, the 8 different
> values of length(H$breaks) are:
>
>  70, 44, 38, 68, 50, 40, 46, 45
>
> ?hist tells me
>
> A)
>  breaks: one of:
>    *  a vector giving the breakpoints between histogram
>       cells,
>    *  a single number giving the number of cells for the
>       histogram,
>    *  a character string naming an algorithm to compute the
>       number of cells (see Details),
>    *  a function to compute the number of cells.
>
>    In the last three cases the number is a suggestion only.
>
> B)
>  The default for 'breaks' is '"Sturges"': see
'nclass.Sturges'.
>
> If I look at the code for nclass.Sturges() I see
>
>  function (x) ceiling(log2(length(x)) + 1)
>
> and, for length(X) = 10000, this gives 15. This is not related
> to any of the numbers of breaks I actually got, in any way obvious
> to me.
>
> So:
> Question 1: hist() has apparently ignored my "suggestion" of
>  "break=50". Why? What is the criterion for ignoring?
>
> Question 2: Presumably, if it ignores the "suggestion", it
>  does something else, of its choice. I would then, perhaps,
>  expect it to fall back to its default, which is (allegedly)
>  Sturges. But the result from nclass.Sturges looks different
>  from what it actually did. So what did it actually do, and
>  how did it decide on this?
>
> With thanks,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding@manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 19-May-08                                       Time: 10:31:20
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

Peter Dalgaard

2008-May-19 10:00 UTC

head link

[R] How hist() decides breaks?

(Ted Harding) wrote:> Hi Folks,
> I'd like to know how hist() decides how many cells to use
> when it ignores my "suggestion" to use say
'hist(...,breaks=50)'.
>
> More specifically, I have the results of 10000 simulations,
> each returning an 8-vector, therefore 8 variables each with
> 10000 values. Some of these 8 have somewhat skew distributions.
> Say one of these 8 variables is X.
>
> I ask for H <- hist(X,breaks=50), and get a histogram which
> usually has a different number of cells than what I intended.
>
> For instance, for one of these simulations, the 8 different
> values of length(H$breaks) are:
>
>   70, 44, 38, 68, 50, 40, 46, 45
>
> ?hist tells me
>
> A)
>   breaks: one of:
>     *  a vector giving the breakpoints between histogram
>        cells,
>     *  a single number giving the number of cells for the
>        histogram,
>     *  a character string naming an algorithm to compute the
>        number of cells (see Details),
>     *  a function to compute the number of cells.
>
>     In the last three cases the number is a suggestion only. 
>
> B)
>   The default for 'breaks' is '"Sturges"': see
'nclass.Sturges'.
>
> If I look at the code for nclass.Sturges() I see
>
>   function (x) ceiling(log2(length(x)) + 1)
>
> and, for length(X) = 10000, this gives 15. This is not related
> to any of the numbers of breaks I actually got, in any way obvious
> to me.
>
> So:
> Question 1: hist() has apparently ignored my "suggestion" of
>   "break=50". Why? What is the criterion for ignoring?
>
> Question 2: Presumably, if it ignores the "suggestion", it
>   does something else, of its choice. I would then, perhaps,
>   expect it to fall back to its default, which is (allegedly)
>   Sturges. But the result from nclass.Sturges looks different
>   from what it actually did. So what did it actually do, and
>   how did it decide on this?
>   No, it is not ignoring you.

Try

hist(rnorm(10000))
length(hist(rnorm(10000),breaks=50)$breaks)

and repeat a dozen of times or so. Chances are that you'll mostly see
lengths around 40, but definitely more than the 17 or so that you'll see
without the breaks=50. Next, try

diff(hist(rnorm(10000),breaks=50)$breaks)

and notice that this is usually 0.2, although if you repeat enough
times, you might get a couple of cases with 0.1 and a length of 75(-ish).

Get it? Otherwise look at help(pretty) since this is what is doing the work.

    -p
> With thanks,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 19-May-08                                       Time: 10:31:20
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Possibly Parallel Threads

Search for more maybe matching threads

R help - May 2008 - How hist() decides breaks?

[R] How hist() decides breaks?

[R] How hist() decides breaks?

[R] How hist() decides breaks?

Possibly Parallel Threads