thr3ads.net - R help - [R] Data-frame selection [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2015-Oct-11 00:12 UTC

[R] Data-frame selection

Sorry I missed the boat the first time, and while it looks like Peter is 
getting closer I suspect that is not quite there either due to the T2 
being considered separate from T3 requirement.

Here is another stab at it:

library(dplyr)
# first approach is broken apart to show the progression of the innards
resultStep1 <- ( teste
                %>% group_by( ID )
                %>% mutate( Group = as.character( Group )
                          , transitionT2 = diff( c( FALSE, "T2"==
Group ) )
                          , transitionT3 = diff( c( FALSE, "T3"==
Group ) )
                          , groupseqT2 = cumsum( abs( transitionT2 ) )
                          , groupseqT3 = cumsum( abs( transitionT3 ) )
                          , isT2 = 1 == groupseqT2
                          , isT3 = 1 == groupseqT3
                          )
                %>% as.data.frame
                )
resultStep1
# notice how the groupseq columns number the groups of consecutive similar
# values, and you are only interested in the groups numbered 1.

# more compactly
result <- (   teste
           %>% group_by( ID )
           %>% mutate( Group = as.character( Group )
                     , keep = ( 1 == cumsum( abs( diff(
 				c( FALSE, "T2"== Group ) ) ) )
                              | 1 == cumsum( abs( diff(
 				c( FALSE, "T3"== Group ) ) ) )
                              )
                     )
           %>% filter( keep )
           %>% select( -keep )
           %>% as.data.frame
)

#####> resultStep1    ID Group  Var transitionT2 transitionT3 groupseqT2 groupseqT3  isT2  isT3
1   3    T2 0.32            1            0          1          0  TRUE FALSE
2   4    T3 1.59            0            1          0          1 FALSE  TRUE
3   1    T2 2.94            1            0          1          0  TRUE FALSE
4   1    T2 3.23            0            0          1          0  TRUE FALSE
5   1    T2 1.40            0            0          1          0  TRUE FALSE
6   1    T2 1.62            0            0          1          0  TRUE FALSE
7   1    T2 2.43            0            0          1          0  TRUE FALSE
8   1    T2 2.53            0            0          1          0  TRUE FALSE
9   1    T2 2.25            0            0          1          0  TRUE FALSE
10  1    T3 1.66           -1            1          2          1 FALSE  TRUE
11  1    T3 2.86            0            0          2          1 FALSE  TRUE
12  1    T3 0.53            0            0          2          1 FALSE  TRUE
13  1    T3 1.66            0            0          2          1 FALSE  TRUE
14  1    T3 3.24            0            0          2          1 FALSE  TRUE
15  1    T3 1.34            0            0          2          1 FALSE  TRUE
16  1    T2 1.86            1           -1          3          2 FALSE FALSE
17  1    T2 3.03            0            0          3          2 FALSE FALSE
18  1    T3 3.63           -1            1          4          3 FALSE FALSE
19  1    T3 2.78            0            0          4          3 FALSE FALSE
20  1    T3 1.49            0            0          4          3 FALSE FALSE
21  2    T2 2.00            1            0          1          0  TRUE FALSE
22  2    T2 2.39            0            0          1          0  TRUE FALSE
23  2    T2 1.65            0            0          1          0  TRUE FALSE
24  2    T2 2.05            0            0          1          0  TRUE FALSE
25  2    T2 2.75            0            0          1          0  TRUE FALSE
26  2    T2 2.23            0            0          1          0  TRUE FALSE
27  2    T2 1.39            0            0          1          0  TRUE FALSE
28  2    T2 2.66            0            0          1          0  TRUE FALSE
29  2    T2 1.05            0            0          1          0  TRUE FALSE
30  2    T3 2.52           -1            1          2          1 FALSE  TRUE
31  2    T2 2.49            1           -1          3          2 FALSE FALSE
32  2    T2 2.97            0            0          3          2 FALSE FALSE
33  2    T2 0.43            0            0          3          2 FALSE FALSE
34  2    T2 1.36            0            0          3          2 FALSE FALSE
35  2    T3 0.79           -1            1          4          3 FALSE FALSE
36  2    T3 1.71            0            0          4          3 FALSE FALSE
37  2    T3 1.95            0            0          4          3 FALSE FALSE
38  2    T2 2.73            1           -1          5          4 FALSE FALSE
39  2    T2 2.73            0            0          5          4 FALSE FALSE
40  2    T2 2.39            0            0          5          4 FALSE FALSE
41  2    T2 2.17            0            0          5          4 FALSE FALSE
42  2    T2 2.34            0            0          5          4 FALSE FALSE
43  2    T3 2.42           -1            1          6          5 FALSE FALSE
44  2    T3 1.75            0            0          6          5 FALSE FALSE
45  2    T3 0.66            0            0          6          5 FALSE FALSE
46  2    T3 1.64            0            0          6          5 FALSE FALSE
47  2    T2 0.24            1           -1          7          6 FALSE FALSE
48  2    T3 2.11           -1            1          8          7 FALSE FALSE
49  2    T3 2.11            0            0          8          7 FALSE FALSE
50  2    T3 1.18            0            0          8          7 FALSE FALSE

On Sun, 11 Oct 2015, peter dalgaard wrote:
> These situations where the desired results depend on the order of
observations in a dataset do tend to get a little tricky (this is one kind of
problem that is easier to handle in a SAS DATA step with its sequential
processing paradigm). I think this will do it:
>
> keep <- function(d)
>   with(d, {
>     n <- length(Group)
>     i <- c(TRUE,Group[-n] != Group[-1])
>     unsplit(lapply(split(i,Group), cumsum), Group) == 1
>   })
> kp <- unsplit(lapply(split(teste, teste$ID), keep), teste$ID)
> teste[kp,]
>
> I.e. keep() is a function applied to each ID-subset of the data frame,
returning a logical vector of the observations that you want to keep.
>
> i is an indicator that an observation is the first in a sequence. Splitting
by group and cumsum'ing gives 1 for the first sequence, 2 for the next, etc.
The observations to keep are the ones for which this value is 1.
>
> -pd
>
>> On 10 Oct 2015, at 22:27 , Cacique Samurai <caciquesamurai at
gmail.com> wrote:
>>
>> Hello Jeff!
>>
>> Thanks very much for your prompt reply, but this is not exactly what I
>> need. I need the first sequence of records. In example that I send, I
>> need the first seven lines of group "T2" in ID "1"
(lines 3 to 9) and
>> others six lines of group "T3" in ID "1" (lines 10
to 15). I have to
>> discard lines 16 to 20, that represent repeated sequential records of
>> those groups in same ID.
>>
>> Others ID (I sent just a small piece of my data) I have much more
>> sequential lines of records of each group in each ID, and many
>> sequential records that should be discarded. I some cases, I have just
>> one record of a group in an ID.
>>
>> As I told, I tried to use a labeling variable, that mark first seven
>> lines 3 to 9 as 1 (first sequence of T2 in ID 1), lines 10 to 15 as 1
>> (first sequence of T3 in ID 1), lines 16 and 17 as 2 (second sequence
>> of T2 in ID 1) and lines 18 to 20 as 2 (second sequence of T3 in ID
>> 1), and so on... Then will be easy take just the first record by each
>> ID. But the code that I made was a long long loop sequence that at end
>> did not work as I want to.
>>
>> Once more, thanks in advanced for your atention and help,
>>
>> Raoni
>>
>> 2015-10-10 13:13 GMT-03:00 Jeff Newmiller <jdnewmil at
dcn.davis.ca.us>:
>>> ?aggregate
>>>
>>> in base R. Make a short function that returns the first element of
a vector and give that to aggregate.
>>>
>>> Or...
>>>
>>> library(dplyr)
>>> ( test %>% group_by( ID, Group ) %>% summarise( Var=first(
Var ) ) %>% as.data.frame )
>>>
---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live Go...
>>>                                      Live:   OO#.. Dead: OO#.. 
Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
with
>>> /Software/Embedded Controllers)               .OO#.       .OO#. 
rocks...1k
>>>
---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On October 10, 2015 8:38:00 AM PDT, Cacique Samurai
<caciquesamurai at gmail.com> wrote:
>>>> Hello R-Helpers!
>>>>
>>>> I have a data-frame as below (dput in the end of mail) and need
to
>>>> select just the first sequence of occurrence of each
"Group" in each
>>>> "ID".
>>>>
>>>> For example, for ID "1" I have two sequential
occurrences of T2 and
>>>> two sequential occurrences of T3:
>>>>
>>>>> test [test$ID == 1, ]
>>>>  ID Group  Var
>>>> 3   1    T2 2.94
>>>> 4   1    T2 3.23
>>>> 5   1    T2 1.40
>>>> 6   1    T2 1.62
>>>> 7   1    T2 2.43
>>>> 8   1    T2 2.53
>>>> 9   1    T2 2.25
>>>> 10  1    T3 1.66
>>>> 11  1    T3 2.86
>>>> 12  1    T3 0.53
>>>> 13  1    T3 1.66
>>>> 14  1    T3 3.24
>>>> 15  1    T3 1.34
>>>> 16  1    T2 1.86
>>>> 17  1    T2 3.03
>>>> 18  1    T3 3.63
>>>> 19  1    T3 2.78
>>>> 20  1    T3 1.49
>>>>
>>>> As output, I need just the first group of T2 and T3 for this
ID, like:
>>>>
>>>> ID Group  Var
>>>> 3   1    T2 2.94
>>>> 4   1    T2 3.23
>>>> 5   1    T2 1.40
>>>> 6   1    T2 1.62
>>>> 7   1    T2 2.43
>>>> 8   1    T2 2.53
>>>> 9   1    T2 2.25
>>>> 10  1    T3 1.66
>>>> 11  1    T3 2.86
>>>> 12  1    T3 0.53
>>>> 13  1    T3 1.66
>>>> 14  1    T3 3.24
>>>> 15  1    T3 1.34
>>>>
>>>> For others ID I have just one occurrence or sequence of
occurrence of
>>>> each Group.
>>>>
>>>> I tried to use a labeling variable, but cannot figure out do
this
>>>> without many many loops..
>>>>
>>>> Thanks in advanced,
>>>>
>>>> Raoni
>>>>
>>>> dput (teste)
>>>> structure(list(ID = structure(c(3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L,
>>>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
>>>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label =
c("1", "2",
>>>> "3", "4"), class = "factor"),
Group = structure(c(1L, 2L, 1L,
>>>> 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
>>>> 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
>>>> 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L),
.Label >>>> c("T2",
>>>> "T3"), class = "factor"), Var = c(0.32,
1.59, 2.94, 3.23, 1.4,
>>>> 1.62, 2.43, 2.53, 2.25, 1.66, 2.86, 0.53, 1.66, 3.24, 1.34,
1.86,
>>>> 3.03, 3.63, 2.78, 1.49, 2, 2.39, 1.65, 2.05, 2.75, 2.23, 1.39,
>>>> 2.66, 1.05, 2.52, 2.49, 2.97, 0.43, 1.36, 0.79, 1.71, 1.95,
2.73,
>>>> 2.73, 2.39, 2.17, 2.34, 2.42, 1.75, 0.66, 1.64, 0.24, 2.11,
2.11,
>>>> 1.18)), .Names = c("ID", "Group",
"Var"), row.names = c(NA, 50L
>>>> ), class = "data.frame")
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>
>>
>>
>> --
>> Raoni Rosa Rodrigues
>> Research Associate of Fish Transposition Center CTPeixes
>> Universidade Federal de Minas Gerais - UFMG
>> Brasil
>> rodrigues.raoni at gmail.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

peter dalgaard

2015-Oct-11 00:40 UTC

head link

[R] Data-frame selection

> On 11 Oct 2015, at 02:12 , Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:
> 
> Sorry I missed the boat the first time, and while it looks like Peter is
getting closer I suspect that is not quite there either due to the T2 being
considered separate from T3 requirement.
Er, what do you mean by that? 

As far as I can tell, we're selecting the same observations:
> teste[kp,]   ID Group  Var
1   3    T2 0.32
2   4    T3 1.59
3   1    T2 2.94
4   1    T2 3.23
5   1    T2 1.40
6   1    T2 1.62
7   1    T2 2.43
8   1    T2 2.53
9   1    T2 2.25
10  1    T3 1.66
11  1    T3 2.86
12  1    T3 0.53
13  1    T3 1.66
14  1    T3 3.24
15  1    T3 1.34
21  2    T2 2.00
22  2    T2 2.39
23  2    T2 1.65
24  2    T2 2.05
25  2    T2 2.75
26  2    T2 2.23
27  2    T2 1.39
28  2    T2 2.66
29  2    T2 1.05
30  2    T3 2.52


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Cacique Samurai

2015-Oct-11 15:10 UTC

head link

[R] Data-frame selection

Hi Peter and Jeff!

Thanks very much for your code! Both worked perfectly in my data set!!

All best,

Raoni

2015-10-10 21:40 GMT-03:00 peter dalgaard <pdalgd at
gmail.com>:>
>> On 11 Oct 2015, at 02:12 , Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:
>>
>> Sorry I missed the boat the first time, and while it looks like Peter
is getting closer I suspect that is not quite there either due to the T2 being
considered separate from T3 requirement.
>
> Er, what do you mean by that?
>
> As far as I can tell, we're selecting the same observations:
>
>> teste[kp,]
>    ID Group  Var
> 1   3    T2 0.32
> 2   4    T3 1.59
> 3   1    T2 2.94
> 4   1    T2 3.23
> 5   1    T2 1.40
> 6   1    T2 1.62
> 7   1    T2 2.43
> 8   1    T2 2.53
> 9   1    T2 2.25
> 10  1    T3 1.66
> 11  1    T3 2.86
> 12  1    T3 0.53
> 13  1    T3 1.66
> 14  1    T3 3.24
> 15  1    T3 1.34
> 21  2    T2 2.00
> 22  2    T2 2.39
> 23  2    T2 1.65
> 24  2    T2 2.05
> 25  2    T2 2.75
> 26  2    T2 2.23
> 27  2    T2 1.39
> 28  2    T2 2.66
> 29  2    T2 1.05
> 30  2    T3 2.52
>
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>


-- 
Raoni Rosa Rodrigues
Research Associate of Fish Transposition Center CTPeixes
Universidade Federal de Minas Gerais - UFMG
Brasil
rodrigues.raoni at gmail.com

Jeff Newmiller

2015-Oct-11 19:53 UTC

head link

[R] Data-frame selection

Sorry, looked like there were a different number of rows in the results because
the rownames were different. I also see that the OP was interested in any
Groups, not just the two in the example, so your solution probably meets the
requirements better than mine

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On October 10, 2015 5:40:37 PM PDT, peter dalgaard <pdalgd at gmail.com>
wrote:>
>> On 11 Oct 2015, at 02:12 , Jeff Newmiller <jdnewmil at
dcn.davis.ca.us>
>wrote:
>> 
>> Sorry I missed the boat the first time, and while it looks like Peter
>is getting closer I suspect that is not quite there either due to the
>T2 being considered separate from T3 requirement.
>
>Er, what do you mean by that? 
>
>As far as I can tell, we're selecting the same observations:
>
>> teste[kp,]
>   ID Group  Var
>1   3    T2 0.32
>2   4    T3 1.59
>3   1    T2 2.94
>4   1    T2 3.23
>5   1    T2 1.40
>6   1    T2 1.62
>7   1    T2 2.43
>8   1    T2 2.53
>9   1    T2 2.25
>10  1    T3 1.66
>11  1    T3 2.86
>12  1    T3 0.53
>13  1    T3 1.66
>14  1    T3 3.24
>15  1    T3 1.34
>21  2    T2 2.00
>22  2    T2 2.39
>23  2    T2 1.65
>24  2    T2 2.05
>25  2    T2 2.75
>26  2    T2 2.23
>27  2    T2 1.39
>28  2    T2 2.66
>29  2    T2 1.05
>30  2    T3 2.52

R help - Oct 2015 - Data-frame selection

[R] Data-frame selection

[R] Data-frame selection

[R] Data-frame selection

[R] Data-frame selection