thr3ads.net - R help - [R] [FORGED] Re: merging-binning data [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Alaios

2015-Nov-04 16:14 UTC

[R] merging-binning data

Thanks for your comments. Actually only the last group has a single element. The
first group is always "full" of members and as that it works fine.
Some constant spacing between the groups would be good as well and thus I will
check quantiles.
Thanks for the great support and time invested on thisRegardsAlex



     On Wednesday, November 4, 2015 3:34 PM, Boris Steipe <boris.steipe at
utoronto.ca> wrote:
   

 Whatever approach is "best" to define subsets depends completely on
the semantics of the data. Your approach (a fixed number of equally spaced
breaks) is the right one if the absolute ranges of the data is important. It
should be obvious that either the top or the bottom group could contain only a
single element, and also that any or all of the intermediate groups could be
empty.

If you want to control the number of elements in your groups, use quantiles
instead.

Your application may require to define the breaks in other ways. The code I have
given you doesn't generalize well, as it depends on the equal spacing of
breaks. As I mentioned earlier, I would not store the groups at all - but would
define a function that returns a vector of elements in the group, and in the
function body I would clearly and explicitly define the conditions for group
membership (and comment it). That is how you make code for a task like this
explicit and _maintainable_.


Cheers,
Boris


On Nov 4, 2015, at 9:19 AM, Alaios <alaios at yahoo.com> wrote:
> Thanks everything is solved and I was even able to plot boxplots as needed.
> The only minor is that the max element falls in the last category and is
only the single one element. Perhaps this can be from the way my data look like.
> Retgards
> Alex
> 
> 
> 
> On Wednesday, November 4, 2015 3:06 PM, Boris Steipe <boris.steipe at
utoronto.ca> wrote:
> 
> 
> The breaks are just the min() and max() in your groups. Something like
> 
>? sprintf("[%5.2f,%5.2f]", min(dBin[groups==2]),
max(dBin[groups==2]))
> 
> ... should achieve what you need.
> 
> 
> B.
> 
> 
> 
> On Nov 4, 2015, at 8:45 AM, Alaios <alaios at yahoo.com> wrote:
> 
> > you are right.
> > by labels I mean the "categories", "breaks" that
my data fall in.
> > To be part of group 2 for example you have to be in the range of
[110,223) I need to keep those for my plots.
> > 
> > Did I describe it more precisely now?
> > Alex
> > 
> > 
> > 
> > On Wednesday, November 4, 2015 2:09 PM, Boris Steipe <boris.steipe
at utoronto.ca> wrote:
> > 
> > 
> > I don't understand: 
> > - where does the "label" come from? (It's not an element
of your data that I see.)
> > - what do you want to do with this "label" i.e. how does it
need to be associated with the data?
> > 
> > 
> > B.
> > 
> > 
> > 
> > On Nov 4, 2015, at 7:57 AM, Alaios <alaios at yahoo.com> wrote:
> > 
> > > Thanks it works great and gives me group numbers as integers and
thus I can with which group the elements as needed (which (groups== 2))
> > > 
> > > Question though is how to keep also the labels for each group.
For example that my first group is the [13,206)
> > > 
> > > Regards
> > > Alex
> > > 
> > > 
> > > 
> > > On Wednesday, November 4, 2015 1:00 PM, Boris Steipe
<boris.steipe at utoronto.ca> wrote:
> > > 
> > > 
> > > I would transform the original numbers into integers which you
can use as group labels. The row numbers of the group labels are the indexes of
your values.
> > > 
> > > Example: assume your input vector is dBin
> > > 
> > > nGroups <- 5? # number of groups
> > > groups <- (dBin - min(dBin)) / (max(dBin) - min(dBin)) #
rescale to the range [0,1]
> > > groups <- floor(groups * nGroups) + 1? # discretize to nGroups
integers
> > > 
> > > Now you can eg. get the indices for group 2
> > > 
> > > groups[groups == 2]
> > > 
> > > Depending on the nature of your input data, it may be better to
keep these groups in a column adjacent to your values, rather than in a separate
vector, or even better to just calculate the groups on the fly in your
downstream analysis with the approach given above in a function, rather than
storing them at all. These are simple operations that should not add perceptibly
to execution time.
> > > 
> > > Cheers,
> > > Boris
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Nov 4, 2015, at 6:40 AM, Alaios via R-help <r-help at
r-project.org> wrote:
> > > 
> > > > Thanks for the answer. Split does not give me the indexes
though but only in which group they fall in. I also need the index of the group.
Is the first, the second .. group?Alex
> > > > 
> > > > 
> > > > 
> > > >? ? On Tuesday, November 3, 2015 5:05 PM, Ista Zahn
<istazahn at gmail.com> wrote:
> > > > 
> > > > 
> > > > Probably
> > > > 
> > > > split(binDistance, test).
> > > > 
> > > > Best,
> > > > Ista
> > > > 
> > > > On Tue, Nov 3, 2015 at 10:47 AM, Alaios via R-help
<r-help at r-project.org> wrote:
> > > >> Dear all,I am not exactly sure on what is the proper
name of what I am trying to do.
> > > >> I have a vector that looks like
> > > >>? binDistance
> > > >>? ? ? ? ? ? [,1]
> > > >>? [1,] 238.95162
> > > >>? [2,] 143.08590
> > > >>? [3,]? 88.50923
> > > >>? [4,] 177.67884
> > > >>? [5,] 277.54116
> > > >>? [6,] 342.94689
> > > >>? [7,] 241.60905
> > > >>? [8,] 177.81969
> > > >>? [9,] 211.25559
> > > >> [10,] 279.72702
> > > >> [11,] 381.95738
> > > >> [12,] 483.76363
> > > >> [13,] 480.98841
> > > >> [14,] 369.75241
> > > >> [15,] 267.73650
> > > >> [16,] 138.55959
> > > >> [17,] 137.93181
> > > >> [18,] 184.75200
> > > >> [19,] 254.64359
> > > >> [20,] 328.87785
> > > >> [21,] 273.15577
> > > >> [22,] 252.52830
> > > >> [23,] 252.52830
> > > >> [24,] 252.52830
> > > >> [25,] 262.20084
> > > >> [26,] 314.93064
> > > >> [27,] 366.02996
> > > >> [28,] 442.77467
> > > >> [29,] 521.20323
> > > >> [30,] 465.33071
> > > >> [31,] 366.60582
> > > >> [32,]? 13.69540
> > > >> so numbers that start from 13 and go up to maximum 522
(I have also many other similar sets).I want to put these numbers into 5
categories and thus I have tried cut
> > > >> 
> > > >> 
> > > >> Browse[2]>
test<-cut(binDistance,seq(min(binDistance)-0.00001,max(binDistance),length.out=scaleLength+1))
> > > >> Browse[2]> test
> > > >>? [1] (217,318]? (115,217]? (13.7,115] (115,217]?
(217,318]? (318,420]
> > > >>? [7] (217,318]? (115,217]? (115,217]? (217,318]?
(318,420]? (420,521]
> > > >> [13] (420,521]? (318,420]? (217,318]? (115,217]?
(115,217]? (115,217]
> > > >> [19] (217,318]? (318,420]? (217,318]? (217,318]?
(217,318]? (217,318]
> > > >> [25] (217,318]? (217,318]? (318,420]? (420,521]?
(420,521]? (420,521]
> > > >> [31] (318,420]? (13.7,115]
> > > >> Levels: (13.7,115] (115,217] (217,318] (318,420]
(420,521]
> > > >> 
> > > >> 
> > > >> I want then for the numbers of my initial vector that
fall within the same "category" lets say the (318,420] to be collected
on a vector.I rephrase it the indexes of my initial vector that have a value
between 318 to 420 to be put in a same vector that I can process then as I want.
> > > >> How I can do that effectively in R?
> > > >> I would like to thank you for your replyRegardsAlex
> > > >> 
> > > >>? ? ? ? [[alternative HTML version deleted]]
> > > >> 
> > > >> ______________________________________________
> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > >> and provide commented, minimal, self-contained,
reproducible code.
> > > > 
> > > > 
> > > >? ? [[alternative HTML version deleted]]
> > > > 
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible
code.
> > > 
> > > 
> > 
> > 
> 
> 

  
	[[alternative HTML version deleted]]

Rolf Turner

2015-Nov-04 22:20 UTC

head link

[R] [FORGED] Re: merging-binning data

I have been vaguely following this thread and have become very confused 
given the complications that seem to have appeared.

The original question was:
>>>>> On Tue, Nov 3, 2015 at 10:47 AM, Alaios via R-help
<r-help at r-project.org> wrote:
>>>>>> Dear all,I am not exactly sure on what is the proper
name of what I am trying to do.
>>>>>> I have a vector that looks like
Actually you appear to have a 32 x 1 *matrix* (NOT the same thing!) that 
looks like:
>>>>>>    binDistance
>>>>>>              [,1]
>>>>>>    [1,] 238.95162
>>>>>>    [2,] 143.08590
>>>>>>    [3,]  88.50923
>>>>>>    [4,] 177.67884
>>>>>>    [5,] 277.54116
>>>>>>    [6,] 342.94689
>>>>>>    [7,] 241.60905
>>>>>>    [8,] 177.81969
>>>>>>    [9,] 211.25559
>>>>>> [10,] 279.72702
>>>>>> [11,] 381.95738
>>>>>> [12,] 483.76363
>>>>>> [13,] 480.98841
>>>>>> [14,] 369.75241
>>>>>> [15,] 267.73650
>>>>>> [16,] 138.55959
>>>>>> [17,] 137.93181
>>>>>> [18,] 184.75200
>>>>>> [19,] 254.64359
>>>>>> [20,] 328.87785
>>>>>> [21,] 273.15577
>>>>>> [22,] 252.52830
>>>>>> [23,] 252.52830
>>>>>> [24,] 252.52830
>>>>>> [25,] 262.20084
>>>>>> [26,] 314.93064
>>>>>> [27,] 366.02996
>>>>>> [28,] 442.77467
>>>>>> [29,] 521.20323
>>>>>> [30,] 465.33071
>>>>>> [31,] 366.60582
>>>>>> [32,]  13.69540
A later addendum to the question indicated that the OP wanted labels for 
the result consisting of the endpoints of the intervals into which the 
data were subdivided.  Unless I am misunderstanding, this is trivial to 
accomplish using cut() and split():

x <- c(238.95162, 143.0859, 88.50923, 177.67884, 277.54116, 342.94689,
241.60905, 177.81969, 211.25559, 279.72702, 381.95738, 483.76363,
480.98841, 369.75241, 267.7365, 138.55959, 137.93181, 184.752,
254.64359, 328.87785, 273.15577, 252.5283, 252.5283, 252.5283,
262.20084, 314.93064, 366.02996, 442.77467, 521.20323, 465.33071,
366.60582, 13.6954)

f <- cut(x,5)

y <- split(x,f)

y

$`(13.2,115]`
[1] 88.50923 13.69540

$`(115,217]`
[1] 143.0859 177.6788 177.8197 211.2556 138.5596 137.9318 184.7520

$`(217,318]`
  [1] 238.9516 277.5412 241.6090 279.7270 267.7365 254.6436 273.1558 
252.5283
  [9] 252.5283 252.5283 262.2008 314.9306

$`(318,420]`
[1] 342.9469 381.9574 369.7524 328.8779 366.0300 366.6058

$`(420,522]`
[1] 483.7636 480.9884 442.7747 521.2032 465.3307


Is this not the result that you want?  If not, what *is* the result that 
you want?

cheers,

Rolf Turner

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

Alaios

2015-Nov-05 09:49 UTC

head link

[R] [FORGED] Re: merging-binning data

Thanks.That is what I want. It is more that I do not know how to read factors
that these two functions return
Browse[1]> y
$`13.6954016405008`
[1] (13.2,115]
Levels: (13.2,115] (115,217] (217,318] (318,420] (420,522]

$`88.5092280867206`
[1] (13.2,115]
Levels: (13.2,115] (115,217] (217,318] (318,420] (420,522]

$`137.931810364616`
[1] (115,217]
Levels: (13.2,115] (115,217] (217,318] (318,420] (420,522]

?str(y)
List of 30
?$ 13.6954016405008: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 1
?$ 88.5092280867206: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 1
?$ 137.931810364616: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 138.559590072838: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 143.085897171535: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 177.678839068735: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 177.819693807561: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 184.752000138622: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 211.255591076421: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 2
?$ 238.951618624679: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 3
?$ 241.609050762905: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 3
?$ 252.528297510773: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 3 3 3
?$ 254.643586371518: Factor w/ 5 levels
"(13.2,115]","(115,217]",..: 3

I need to be able to keep the items within their groups and at the same time to
keep the label of the group so to be able to use it for plotting purposes.
How I can do that?RegardsAlex
 


     On Wednesday, November 4, 2015 11:20 PM, Rolf Turner <r.turner at
auckland.ac.nz> wrote:
   

 
I have been vaguely following this thread and have become very confused 
given the complications that seem to have appeared.

The original question was:
>>>>> On Tue, Nov 3, 2015 at 10:47 AM, Alaios via R-help
<r-help at r-project.org> wrote:
>>>>>> Dear all,I am not exactly sure on what is the proper
name of what I am trying to do.
>>>>>> I have a vector that looks like
Actually you appear to have a 32 x 1 *matrix* (NOT the same thing!) that 
looks like:
>>>>>>? ? binDistance
>>>>>>? ? ? ? ? ? ? [,1]
>>>>>>? ? [1,] 238.95162
>>>>>>? ? [2,] 143.08590
>>>>>>? ? [3,]? 88.50923
>>>>>>? ? [4,] 177.67884
>>>>>>? ? [5,] 277.54116
>>>>>>? ? [6,] 342.94689
>>>>>>? ? [7,] 241.60905
>>>>>>? ? [8,] 177.81969
>>>>>>? ? [9,] 211.25559
>>>>>> [10,] 279.72702
>>>>>> [11,] 381.95738
>>>>>> [12,] 483.76363
>>>>>> [13,] 480.98841
>>>>>> [14,] 369.75241
>>>>>> [15,] 267.73650
>>>>>> [16,] 138.55959
>>>>>> [17,] 137.93181
>>>>>> [18,] 184.75200
>>>>>> [19,] 254.64359
>>>>>> [20,] 328.87785
>>>>>> [21,] 273.15577
>>>>>> [22,] 252.52830
>>>>>> [23,] 252.52830
>>>>>> [24,] 252.52830
>>>>>> [25,] 262.20084
>>>>>> [26,] 314.93064
>>>>>> [27,] 366.02996
>>>>>> [28,] 442.77467
>>>>>> [29,] 521.20323
>>>>>> [30,] 465.33071
>>>>>> [31,] 366.60582
>>>>>> [32,]? 13.69540
A later addendum to the question indicated that the OP wanted labels for 
the result consisting of the endpoints of the intervals into which the 
data were subdivided.? Unless I am misunderstanding, this is trivial to 
accomplish using cut() and split():

x <- c(238.95162, 143.0859, 88.50923, 177.67884, 277.54116, 342.94689,
241.60905, 177.81969, 211.25559, 279.72702, 381.95738, 483.76363,
480.98841, 369.75241, 267.7365, 138.55959, 137.93181, 184.752,
254.64359, 328.87785, 273.15577, 252.5283, 252.5283, 252.5283,
262.20084, 314.93064, 366.02996, 442.77467, 521.20323, 465.33071,
366.60582, 13.6954)

f <- cut(x,5)

y <- split(x,f)

y

$`(13.2,115]`
[1] 88.50923 13.69540

$`(115,217]`
[1] 143.0859 177.6788 177.8197 211.2556 138.5596 137.9318 184.7520

$`(217,318]`
? [1] 238.9516 277.5412 241.6090 279.7270 267.7365 254.6436 273.1558 
252.5283
? [9] 252.5283 252.5283 262.2008 314.9306

$`(318,420]`
[1] 342.9469 381.9574 369.7524 328.8779 366.0300 366.6058

$`(420,522]`
[1] 483.7636 480.9884 442.7747 521.2032 465.3307


Is this not the result that you want?? If not, what *is* the result that 
you want?

cheers,

Rolf Turner

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276


  
	[[alternative HTML version deleted]]

R help - Nov 2015 - [FORGED] Re: merging-binning data

[R] merging-binning data

[R] [FORGED] Re: merging-binning data

[R] [FORGED] Re: merging-binning data