Jim Lemon
2018-Sep-27 22:35 UTC
[R] subset only if f.e a column is successive for more than 3 values
Hi Knut,
As Bert said, you can start with diff and work from there. I can
easily get the text for the subset, but despite fooling around with
"parse", "eval" and "expression", I couldn't
get it to work:
# use a bigger subset to test whether multiple runs can be extracted
kkdf<-subset(airquality,Temp >
77,select=c("Ozone","Temp"))
kkdf$index<-as.numeric(rownames(kkdf))
# get the run length encoding
seqindx<-rle(diff(kkdf$index)==1)
# get a logical vector of the starts of the runs
runsel<-seqindx$lengths >= 3 & seqindx$values
# get the indices for the starts of the runs
starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
# and the ends
ends<-cumsum(seqindx$lengths)[runsel]+1
# the character representation of the subset as indices is
paste0("c(",paste(starts,ends,sep=":",collapse=","),")")
I expect there will be a lightning response from someone who knows
about converting the resulting string into whatever is needed.
Jim
On Fri, Sep 28, 2018 at 1:13 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:>
> 1. I assume the values are integers, not floats/numerics (which woud make
> it more complicated).
>
> 2. Strategy: Take differences (e.g. see ?diff) and look for >3 1's
in a
> row.
>
> I don't have time to work out details, but perhaps that helps.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Thu, Sep 27, 2018 at 7:49 AM Knut Krueger <rhelp at
krueger-family.de>
> wrote:
>
> > Hi to all
> >
> > I need a subset for values if there are f.e 3 values successive in a
> > column of a Data Frame:
> > Example from the subset help page:
> >
> > subset(airquality, Temp > 80, select = c(Ozone, Temp))
> > 29 45 81
> > 35 NA 84
> > 36 NA 85
> > 38 29 82
> > 39 NA 87
> > 40 71 90
> > 41 39 87
> > 42 NA 93
> > 43 NA 92
> > 44 23 82
> > .....
> >
> > I would like to get only
> >
> > ...
> > 40 71 90
> > 41 39 87
> > 42 NA 93
> > 43 NA 92
> > 44 23 82
> > ....
> >
> > because the left column is ascending more than f.e three times without
gap
> >
> > Any hints for a package or do I need to build a own function?
> >
> > Kind Regards Knut
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2018-Sep-27 22:43 UTC
[R] subset only if f.e a column is successive for more than 3 values
Bugger! It's
eval(parse(text=paste0("kkdf[c(",paste(starts,ends,sep=":",collapse=","),"),]")))
What a mess!
Jim
On Fri, Sep 28, 2018 at 8:35 AM Jim Lemon <drjimlemon at gmail.com>
wrote:>
> Hi Knut,
> As Bert said, you can start with diff and work from there. I can
> easily get the text for the subset, but despite fooling around with
> "parse", "eval" and "expression", I
couldn't get it to work:
>
> # use a bigger subset to test whether multiple runs can be extracted
> kkdf<-subset(airquality,Temp >
77,select=c("Ozone","Temp"))
> kkdf$index<-as.numeric(rownames(kkdf))
> # get the run length encoding
> seqindx<-rle(diff(kkdf$index)==1)
> # get a logical vector of the starts of the runs
> runsel<-seqindx$lengths >= 3 & seqindx$values
> # get the indices for the starts of the runs
> starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
> # and the ends
> ends<-cumsum(seqindx$lengths)[runsel]+1
> # the character representation of the subset as indices is
>
paste0("c(",paste(starts,ends,sep=":",collapse=","),")")
>
> I expect there will be a lightning response from someone who knows
> about converting the resulting string into whatever is needed.
>
> Jim
> On Fri, Sep 28, 2018 at 1:13 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
> >
> > 1. I assume the values are integers, not floats/numerics (which woud
make
> > it more complicated).
> >
> > 2. Strategy: Take differences (e.g. see ?diff) and look for >3
1's in a
> > row.
> >
> > I don't have time to work out details, but perhaps that helps.
> >
> > Cheers,
> > Bert
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming
along and
> > sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
> >
> >
> > On Thu, Sep 27, 2018 at 7:49 AM Knut Krueger <rhelp at
krueger-family.de>
> > wrote:
> >
> > > Hi to all
> > >
> > > I need a subset for values if there are f.e 3 values successive
in a
> > > column of a Data Frame:
> > > Example from the subset help page:
> > >
> > > subset(airquality, Temp > 80, select = c(Ozone, Temp))
> > > 29 45 81
> > > 35 NA 84
> > > 36 NA 85
> > > 38 29 82
> > > 39 NA 87
> > > 40 71 90
> > > 41 39 87
> > > 42 NA 93
> > > 43 NA 92
> > > 44 23 82
> > > .....
> > >
> > > I would like to get only
> > >
> > > ...
> > > 40 71 90
> > > 41 39 87
> > > 42 NA 93
> > > 43 NA 92
> > > 44 23 82
> > > ....
> > >
> > > because the left column is ascending more than f.e three times
without gap
> > >
> > > Any hints for a package or do I need to build a own function?
> > >
> > > Kind Regards Knut
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
Knut Krueger
2018-Sep-28 15:08 UTC
[R] subset only if f.e a column is successive for more than 3 values
Hi Jim,
thank's it is working with the given example,
but whats the difference when using
testdata=data.frame(TIME=c("17:11:20", "17:11:21",
"17:11:22",
"17:11:23", "17:11:24", "17:11:25",
"17:11:26", "17:11:27", "17:11:28",
"17:21:43",
"17:22:16", "17:22:19",
"18:04:48", "18:04:49",
"18:04:50", "18:04:51", "18:04:52",
"19:50:09", "00:59:27", "00:59:28",
"00:59:29", "04:13:40",
"04:13:43", "04:13:44"),
index=c(8960,8961,8962,8963,8964,8965,8966,8967,8968,9583,9616,9619,12168,12169,12170,12171,12172,18489
,37047,37048,37049,48700,48701,48702))
seqindx<-rle(diff(testdata$index)==1)
runsel<-seqindx$lengths >= 3 & seqindx$values
# get the indices for the starts of the runs
starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
# and the ends
ends<-cumsum(seqindx$lengths)[runsel]+1
eval(parse(text=paste0("testdata[c(",paste(starts,ends,sep=":",collapse=","),"),]")))
the result (index) is
12168,9619,9616,9583,8968,12168,12169,12170,12171,12172
maybe the gaps between .. 8967,8968,9583,9616,9619,12168,12169 ..?
Regards Knut