thr3ads.net - R help - [R] Tricky filtering [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Cacique Samurai

2019-Oct-31 02:54 UTC

[R] Tricky filtering

Hi all,

I had a fish telemetry data with more then 11 million lines. I had some
false records in the data, that I have to eliminate. I can solve this using
a loop, but I think that dplyr:: filter could be faster and elegant. I just
can't figure out how to do it.

At this moment, I already summarized this raw data, and had something like
this (dput at end of e-mail):

Date Station Antenna Mean_power N_records *Action need (manually inserted)*
29/03/2019 ANT01 1 108 1704 Remove
29/03/2019 ANT01 2 94 1219 Remove
29/03/2019 ANT02 1 220 3029 Keep
29/03/2019 ANT02 2 219 2711 Keep
30/03/2019 ANT01 1 204 2289 Keep
30/03/2019 ANT01 2 172 1477 Keep
30/03/2019 ANT02 1 88 913 Remove
30/03/2019 ANT02 2 72 1080 Remove
30/03/2019 ETE01 AH0 87 1 Keep

The problem occurs between Stations ANT01 and ANT02. In the same day, I
have to keep the pair of records that have bigger Mean_power and more
N_records. In this example, I have to keep records in Station ANT02 in
29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01 and
ANT02 in the same day, it was a simple question.

I have to do this for each marked fish, that is identified by a Code supres
here for resuming.

Thanks in advanced,

Raoni


structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, 17985,
17985, 17985, 17985), class = "Date"),
Station = c("ANT01","ANT01", "ANT02",
"ANT02", "ANT01", "ANT01", "ANT02",
"ANT02","ETE01"),
Antenna = c("1", "2", "1", "2",
"1", "2", "1", "2","AH0"),
Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records c(1704L,
1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)),
row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl",
"data.frame"),
groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985,
17985), class = "Date"), Station = c("ANT01",
"ANT02", "ANT01", "ANT02", "ETE01"),
.rows = list(1:2, 3:4, 5:6, 7:8, 9L)),
row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE))







--
Raoni Rosa Rodrigues
Research Associate of Fish Transposition Center CTPeixes
Universidade Federal de Minas Gerais - UFMG
Brasil
rodrigues.raoni at gmail.com

	[[alternative HTML version deleted]]

Bert Gunter

2019-Oct-31 04:17 UTC

head link

[R] Tricky filtering

Thanks for the nice dput example, but your specification confuses me.
What if the 2 records with largest Mean_power are not the same as the two
with largest N_records. Do you want to keep all four records? Or various
combinations of this question that would keep 3 records. And will you
always have two records on a date, or could you have just one? And if the 2
records with largest Mean_power always also have the largest N_records,
then you only need to choose the two with largest Mean_power and can ignore
the N_records, right?

Once you have answered these questions -- or someone else has a better
understanding than I -- it should be easy. It will require a loop of one
form or another, however, and therefore might take a while.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai <caciquesamurai at
gmail.com>
wrote:
> Hi all,
>
> I had a fish telemetry data with more then 11 million lines. I had some
> false records in the data, that I have to eliminate. I can solve this using
> a loop, but I think that dplyr:: filter could be faster and elegant. I just
> can't figure out how to do it.
>
> At this moment, I already summarized this raw data, and had something like
> this (dput at end of e-mail):
>
> Date Station Antenna Mean_power N_records *Action need (manually inserted)*
> 29/03/2019 ANT01 1 108 1704 Remove
> 29/03/2019 ANT01 2 94 1219 Remove
> 29/03/2019 ANT02 1 220 3029 Keep
> 29/03/2019 ANT02 2 219 2711 Keep
> 30/03/2019 ANT01 1 204 2289 Keep
> 30/03/2019 ANT01 2 172 1477 Keep
> 30/03/2019 ANT02 1 88 913 Remove
> 30/03/2019 ANT02 2 72 1080 Remove
> 30/03/2019 ETE01 AH0 87 1 Keep
>
> The problem occurs between Stations ANT01 and ANT02. In the same day, I
> have to keep the pair of records that have bigger Mean_power and more
> N_records. In this example, I have to keep records in Station ANT02 in
> 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01 and
> ANT02 in the same day, it was a simple question.
>
> I have to do this for each marked fish, that is identified by a Code supres
> here for resuming.
>
> Thanks in advanced,
>
> Raoni
>
>
> structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, 17985,
> 17985, 17985, 17985), class = "Date"),
> Station = c("ANT01","ANT01", "ANT02",
"ANT02", "ANT01", "ANT01", "ANT02",
> "ANT02","ETE01"),
> Antenna = c("1", "2", "1", "2",
"1", "2", "1", "2","AH0"),
> Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records >
c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)),
> row.names = c(NA, -9L), class = c("grouped_df",
"tbl_df", "tbl",
> "data.frame"),
> groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985,
> 17985), class = "Date"), Station = c("ANT01",
> "ANT02", "ANT01", "ANT02",
"ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8, 9L)),
> row.names = c(NA, -5L), class = c("tbl_df", "tbl",
> "data.frame"), .drop = TRUE))
>
>
>
>
>
>
>
> --
> Raoni Rosa Rodrigues
> Research Associate of Fish Transposition Center CTPeixes
> Universidade Federal de Minas Gerais - UFMG
> Brasil
> rodrigues.raoni at gmail.com
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

PIKAL Petr

2019-Oct-31 07:29 UTC

head link

[R] Tricky filtering

Hi.

Bert's questions should be clarified. But from your question I understand
that only ANT01 and ANT02 are the Stations which you want to filter and all
others you want to keep regardless of condition. If this is true, I would
add the new column which would have one value for ANT stations and different
for all others (if you have more than one). Than you could set flag which is
the biggest number in each day. And after that you could add in each day
stations different from ANT and want to keep.

I named your data as test and change them to data frame as I am not familiar
with tibbles.

The code is like that.
test$m <- ave(test$N_records, interaction(test$Date, test$Station),
FUN=mean)
test$flag <- ave(test$m, test$Date, FUN=function(x) max(x) == x)
test$keep <- test$flag + (test$Station == "ETE01")*1

but you need to think about questions asked by Bert.

Cheers
Petr
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert
Gunter
> Sent: Thursday, October 31, 2019 5:18 AM
> To: Cacique Samurai <caciquesamurai at gmail.com>
> Cc: R help <r-help at r-project.org>
> Subject: Re: [R] Tricky filtering
> 
> Thanks for the nice dput example, but your specification confuses me.
> What if the 2 records with largest Mean_power are not the same as the two
> with largest N_records. Do you want to keep all four records? Or various
> combinations of this question that would keep 3 records. And will you
> always have two records on a date, or could you have just one? And if the
2> records with largest Mean_power always also have the largest N_records,
> then you only need to choose the two with largest Mean_power and can
> ignore the N_records, right?
> 
> Once you have answered these questions -- or someone else has a better
> understanding than I -- it should be easy. It will require a loop of one
form or> another, however, and therefore might take a while.
> 
> Cheers,
> Bert
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
> 
> 
> On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai
> <caciquesamurai at gmail.com>
> wrote:
> 
> > Hi all,
> >
> > I had a fish telemetry data with more then 11 million lines. I had
> > some false records in the data, that I have to eliminate. I can solve
> > this using a loop, but I think that dplyr:: filter could be faster and
> > elegant. I just can't figure out how to do it.
> >
> > At this moment, I already summarized this raw data, and had something
> > like this (dput at end of e-mail):
> >
> > Date Station Antenna Mean_power N_records *Action need (manually
> > inserted)*
> > 29/03/2019 ANT01 1 108 1704 Remove
> > 29/03/2019 ANT01 2 94 1219 Remove
> > 29/03/2019 ANT02 1 220 3029 Keep
> > 29/03/2019 ANT02 2 219 2711 Keep
> > 30/03/2019 ANT01 1 204 2289 Keep
> > 30/03/2019 ANT01 2 172 1477 Keep
> > 30/03/2019 ANT02 1 88 913 Remove
> > 30/03/2019 ANT02 2 72 1080 Remove
> > 30/03/2019 ETE01 AH0 87 1 Keep
> >
> > The problem occurs between Stations ANT01 and ANT02. In the same day,
> > I have to keep the pair of records that have bigger Mean_power and
> > more N_records. In this example, I have to keep records in Station
> > ANT02 in
> > 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than
> > ANT01 and
> > ANT02 in the same day, it was a simple question.
> >
> > I have to do this for each marked fish, that is identified by a Code
> > supres here for resuming.
> >
> > Thanks in advanced,
> >
> > Raoni
> >
> >
> > structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985,
> > 17985, 17985, 17985, 17985), class = "Date"), Station >
> c("ANT01","ANT01", "ANT02",
"ANT02", "ANT01", "ANT01", "ANT02",
> > "ANT02","ETE01"), Antenna = c("1",
"2", "1", "2", "1", "2",
"1",
> > "2","AH0"), Media_power = c(108, 94, 220, 219,
204, 172, 88, 72, 87),
> > N_records = c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L,
> > 1L)), row.names = c(NA, -9L), class = c("grouped_df",
"tbl_df", "tbl",
> > "data.frame"), groups = structure(list(Date =
structure(c(17984,
> > 17984, 17985, 17985, 17985), class = "Date"), Station =
c("ANT01",
> > "ANT02", "ANT01", "ANT02",
"ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8,
> > 9L)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl",
> > "data.frame"), .drop = TRUE))
> >
> >
> >
> >
> >
> >
> >
> > --
> > Raoni Rosa Rodrigues
> > Research Associate of Fish Transposition Center CTPeixes Universidade
> > Federal de Minas Gerais - UFMG Brasil rodrigues.raoni at gmail.com
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Cacique Samurai

2019-Oct-31 09:23 UTC

head link

[R] Tricky filtering

Hi Bert, thanks for your replay, and sorry for not be so clear. Let?s try:

What if the 2 records with largest Mean_power are not the same as the two
with largest N_records. Do you want to keep all four records?
In the sample data that I used to understand what is going on, this never
happened.  But, if so, I should ignore N_records and use just Mean_power.

Or various combinations of this question that would keep 3 records.
No, at this moment, I just need to keep with one record. Maybe in the
future I will need to filter the raw data, but now I just need to have one
record in ANT01 OR ANT02 per day.

And will you always have two records on a date, or could you have just one?
Yes, I can have just one record. Probably will be with the ANT that have
lower Mean_power.

And if the 2 records with largest Mean_power always also have the largest
N_records, then you only need to choose the two with largest Mean_power and
can ignore the N_records, right?
Right, exactly that!

Thanks for your attention and help!

Raoni


Em qui, 31 de out de 2019 ?s 01:17, Bert Gunter <bgunter.4567 at
gmail.com>
escreveu:
> Thanks for the nice dput example, but your specification confuses me.
> What if the 2 records with largest Mean_power are not the same as the two
> with largest N_records. Do you want to keep all four records? Or various
> combinations of this question that would keep 3 records. And will you
> always have two records on a date, or could you have just one? And if the 2
> records with largest Mean_power always also have the largest N_records,
> then you only need to choose the two with largest Mean_power and can ignore
> the N_records, right?
>
> Once you have answered these questions -- or someone else has a better
> understanding than I -- it should be easy. It will require a loop of one
> form or another, however, and therefore might take a while.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai <caciquesamurai at
gmail.com>
> wrote:
>
>> Hi all,
>>
>> I had a fish telemetry data with more then 11 million lines. I had some
>> false records in the data, that I have to eliminate. I can solve this
>> using
>> a loop, but I think that dplyr:: filter could be faster and elegant. I
>> just
>> can't figure out how to do it.
>>
>> At this moment, I already summarized this raw data, and had something
like
>> this (dput at end of e-mail):
>>
>> Date Station Antenna Mean_power N_records *Action need (manually
>> inserted)*
>> 29/03/2019 ANT01 1 108 1704 Remove
>> 29/03/2019 ANT01 2 94 1219 Remove
>> 29/03/2019 ANT02 1 220 3029 Keep
>> 29/03/2019 ANT02 2 219 2711 Keep
>> 30/03/2019 ANT01 1 204 2289 Keep
>> 30/03/2019 ANT01 2 172 1477 Keep
>> 30/03/2019 ANT02 1 88 913 Remove
>> 30/03/2019 ANT02 2 72 1080 Remove
>> 30/03/2019 ETE01 AH0 87 1 Keep
>>
>> The problem occurs between Stations ANT01 and ANT02. In the same day, I
>> have to keep the pair of records that have bigger Mean_power and more
>> N_records. In this example, I have to keep records in Station ANT02 in
>> 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01
>> and
>> ANT02 in the same day, it was a simple question.
>>
>> I have to do this for each marked fish, that is identified by a Code
>> supres
>> here for resuming.
>>
>> Thanks in advanced,
>>
>> Raoni
>>
>>
>> structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985,
>> 17985,
>> 17985, 17985, 17985), class = "Date"),
>> Station = c("ANT01","ANT01", "ANT02",
"ANT02", "ANT01", "ANT01", "ANT02",
>> "ANT02","ETE01"),
>> Antenna = c("1", "2", "1", "2",
"1", "2", "1", "2","AH0"),
>> Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records
>> c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)),
>> row.names = c(NA, -9L), class = c("grouped_df",
"tbl_df", "tbl",
>> "data.frame"),
>> groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985,
>> 17985), class = "Date"), Station = c("ANT01",
>> "ANT02", "ANT01", "ANT02",
"ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8,
>> 9L)),
>> row.names = c(NA, -5L), class = c("tbl_df", "tbl",
>> "data.frame"), .drop = TRUE))
>>
>>
>>
>>
>>
>>
>>
>> --
>> Raoni Rosa Rodrigues
>> Research Associate of Fish Transposition Center CTPeixes
>> Universidade Federal de Minas Gerais - UFMG
>> Brasil
>> rodrigues.raoni at gmail.com
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
-- 
Raoni Rosa Rodrigues
Research Associate of Fish Transposition Center CTPeixes
Universidade Federal de Minas Gerais - UFMG
Brasil
rodrigues.raoni at gmail.com

	[[alternative HTML version deleted]]

R help - Oct 2019 - Tricky filtering

[R] Tricky filtering

[R] Tricky filtering

[R] Tricky filtering

[R] Tricky filtering