thr3ads.net - R help - [R] what does cut(data, breaks=n) actually do? [Dec 2007]

If this information is useful, please help other people find it:
Share via:

melissa cline

2007-Dec-13 01:04 UTC

[R] what does cut(data, breaks=n) actually do?

Hello,

I'm trying to bin a quantity into 2-3 bins for calculating entropy and
mutual information.  One of the approaches I'm exploring is the cut()
function, which is what the mutualInfo function in binDist uses.  When it's
called in the format cut(data, breaks=n), it somehow splits the data into n
distinct bins.  Can anyone tell me how cut() decides where to cut?

Thanks,

Melissa



---------------------------------------------------------------
Melissa Cline, Independent Investigator
MCD Biology, UCSC

	[[alternative HTML version deleted]]

Peter Dalgaard

2007-Dec-13 08:32 UTC

head link

[R] what does cut(data, breaks=n) actually do?

melissa cline wrote:> Hello,
>
> I'm trying to bin a quantity into 2-3 bins for calculating entropy and
> mutual information.  One of the approaches I'm exploring is the cut()
> function, which is what the mutualInfo function in binDist uses.  When
it's
> called in the format cut(data, breaks=n), it somehow splits the data into n
> distinct bins.  Can anyone tell me how cut() decides where to cut?
>
>   This is one case where reading the actual R code is easier that 
explaining what it does.  From cut.default

    if (length(breaks) == 1) {
        if (is.na(breaks) | breaks < 2)
            stop("invalid number of intervals")
        nb <- as.integer(breaks + 1)
        dx <- diff(rx <- range(x, na.rm = TRUE))
        if (dx == 0)
            dx <- rx[1]
        breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out = nb)
    }

so basically it takes the range, extends it a bit and splits in into 
<breaks> equally long segments.

(For the sometimes more attractive option of splitting into groups of 
roughly equal size, there is cut2 in the Hmisc package, or use quantile())

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Domenico Vistocco

2007-Dec-13 09:17 UTC

head link

[R] what does cut(data, breaks=n) actually do?

cut(data, breaks=n)
splits the data in n bins of (approximately) the same size.

The used size is obtained by:
max(data) - min(data)
------------------------------------
                 n

 > x=rnorm(x)
 > cut(x,breaks=3)
 [1] (1.79,9.97]  (-6.39,1.79] (9.97,18.2]  (9.97,18.2]  (-6.39,1.79]
 [6] (1.79,9.97]  (-6.39,1.79] (1.79,9.97]  (-6.39,1.79] (-6.39,1.79]
Levels: (-6.39,1.79] (1.79,9.97] (9.97,18.2]

Then you have:
 > 18.2-9.97
[1] 8.23
 > 9.97-1.79
[1] 8.18
 > 1.79+6.39
[1] 8.18
 >

 > (max(x)-min(x))/3
[1] 8.164187

I don't know the reasons for the little differences (I am wondering about).
I hope it is useful.
domenico

melissa cline wrote:> Hello,
>
> I'm trying to bin a quantity into 2-3 bins for calculating entropy and
> mutual information.  One of the approaches I'm exploring is the cut()
> function, which is what the mutualInfo function in binDist uses.  When
it's
> called in the format cut(data, breaks=n), it somehow splits the data into n
> distinct bins.  Can anyone tell me how cut() decides where to cut?
>
> Thanks,
>
> Melissa
>
>
>
> ---------------------------------------------------------------
> Melissa Cline, Independent Investigator
> MCD Biology, UCSC
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

Tony Plate

2007-Dec-15 05:45 UTC

head link

[R] what does cut(data, breaks=n) actually do?

Peter Dalgaard wrote:> melissa cline wrote:
>> Hello,
>>
>> I'm trying to bin a quantity into 2-3 bins for calculating entropy
and
>> mutual information.  One of the approaches I'm exploring is the
cut()
>> function, which is what the mutualInfo function in binDist uses.  When
it's
>> called in the format cut(data, breaks=n), it somehow splits the data
into n
>> distinct bins.  Can anyone tell me how cut() decides where to cut?
>>
>>   
> This is one case where reading the actual R code is easier that 
> explaining what it does.  From cut.default
> 
>     if (length(breaks) == 1) {
>         if (is.na(breaks) | breaks < 2)
>             stop("invalid number of intervals")
>         nb <- as.integer(breaks + 1)
>         dx <- diff(rx <- range(x, na.rm = TRUE))
>         if (dx == 0)
>             dx <- rx[1]
>         breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out =
nb)
>     }
> 
> so basically it takes the range, extends it a bit and splits in into 
> <breaks> equally long segments.
> 
> (For the sometimes more attractive option of splitting into groups of 
> roughly equal size, there is cut2 in the Hmisc package, or use quantile())
> 
It can be a bit dangerous to use quantile() to provide breaks for cut(),
because quantiles can be non-unique, which cut() doesn't
like:> x1 <- c(1,1,1,1,1,1,1,1,1,2)
> cut(x1, breaks=quantile(x1, (0:2)/2))Error in cut.default(x1, breaks = quantile(x1, (0:2)/2)) :
   'breaks' are not unique>
However, cut2() in Hmisc handles this situation
gracefully:> library(Hmisc)Attaching package: 'Hmisc'
        The following object(s) are masked from package:base :
          format.pval,
          round.POSIXt,
          trunc.POSIXt,
          units> cut2(x1, g=2)  [1] 1 1 1 1 1 1 1 1 1 2
Levels: 1 2>
(Additionally, a potentially dangerous peculiarity of quantile() for 
this kind of purpose is that its return values can be out of order 
(i.e., diff(quantile(...))<0, at rounding error level), but this doesn't 
actually upset cut() in R because cut() sorts the breaks prior to using 
them.)

-- Tony Plate

Maybe Matching Threads

Search for more possibly parallel threads

R help - Dec 2007 - what does cut(data, breaks=n) actually do?

[R] what does cut(data, breaks=n) actually do?

[R] what does cut(data, breaks=n) actually do?

[R] what does cut(data, breaks=n) actually do?

[R] what does cut(data, breaks=n) actually do?

Maybe Matching Threads