Hi all, I had a fish telemetry data with more then 11 million lines. I had some false records in the data, that I have to eliminate. I can solve this using a loop, but I think that dplyr:: filter could be faster and elegant. I just can't figure out how to do it. At this moment, I already summarized this raw data, and had something like this (dput at end of e-mail): Date Station Antenna Mean_power N_records *Action need (manually inserted)* 29/03/2019 ANT01 1 108 1704 Remove 29/03/2019 ANT01 2 94 1219 Remove 29/03/2019 ANT02 1 220 3029 Keep 29/03/2019 ANT02 2 219 2711 Keep 30/03/2019 ANT01 1 204 2289 Keep 30/03/2019 ANT01 2 172 1477 Keep 30/03/2019 ANT02 1 88 913 Remove 30/03/2019 ANT02 2 72 1080 Remove 30/03/2019 ETE01 AH0 87 1 Keep The problem occurs between Stations ANT01 and ANT02. In the same day, I have to keep the pair of records that have bigger Mean_power and more N_records. In this example, I have to keep records in Station ANT02 in 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01 and ANT02 in the same day, it was a simple question. I have to do this for each marked fish, that is identified by a Code supres here for resuming. Thanks in advanced, Raoni structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, 17985, 17985, 17985, 17985), class = "Date"), Station = c("ANT01","ANT01", "ANT02", "ANT02", "ANT01", "ANT01", "ANT02", "ANT02","ETE01"), Antenna = c("1", "2", "1", "2", "1", "2", "1", "2","AH0"), Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985, 17985), class = "Date"), Station = c("ANT01", "ANT02", "ANT01", "ANT02", "ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8, 9L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE)) -- Raoni Rosa Rodrigues Research Associate of Fish Transposition Center CTPeixes Universidade Federal de Minas Gerais - UFMG Brasil rodrigues.raoni at gmail.com [[alternative HTML version deleted]]
Thanks for the nice dput example, but your specification confuses me. What if the 2 records with largest Mean_power are not the same as the two with largest N_records. Do you want to keep all four records? Or various combinations of this question that would keep 3 records. And will you always have two records on a date, or could you have just one? And if the 2 records with largest Mean_power always also have the largest N_records, then you only need to choose the two with largest Mean_power and can ignore the N_records, right? Once you have answered these questions -- or someone else has a better understanding than I -- it should be easy. It will require a loop of one form or another, however, and therefore might take a while. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai <caciquesamurai at gmail.com> wrote:> Hi all, > > I had a fish telemetry data with more then 11 million lines. I had some > false records in the data, that I have to eliminate. I can solve this using > a loop, but I think that dplyr:: filter could be faster and elegant. I just > can't figure out how to do it. > > At this moment, I already summarized this raw data, and had something like > this (dput at end of e-mail): > > Date Station Antenna Mean_power N_records *Action need (manually inserted)* > 29/03/2019 ANT01 1 108 1704 Remove > 29/03/2019 ANT01 2 94 1219 Remove > 29/03/2019 ANT02 1 220 3029 Keep > 29/03/2019 ANT02 2 219 2711 Keep > 30/03/2019 ANT01 1 204 2289 Keep > 30/03/2019 ANT01 2 172 1477 Keep > 30/03/2019 ANT02 1 88 913 Remove > 30/03/2019 ANT02 2 72 1080 Remove > 30/03/2019 ETE01 AH0 87 1 Keep > > The problem occurs between Stations ANT01 and ANT02. In the same day, I > have to keep the pair of records that have bigger Mean_power and more > N_records. In this example, I have to keep records in Station ANT02 in > 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01 and > ANT02 in the same day, it was a simple question. > > I have to do this for each marked fish, that is identified by a Code supres > here for resuming. > > Thanks in advanced, > > Raoni > > > structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, 17985, > 17985, 17985, 17985), class = "Date"), > Station = c("ANT01","ANT01", "ANT02", "ANT02", "ANT01", "ANT01", "ANT02", > "ANT02","ETE01"), > Antenna = c("1", "2", "1", "2", "1", "2", "1", "2","AH0"), > Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records > c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)), > row.names = c(NA, -9L), class = c("grouped_df", "tbl_df", "tbl", > "data.frame"), > groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985, > 17985), class = "Date"), Station = c("ANT01", > "ANT02", "ANT01", "ANT02", "ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8, 9L)), > row.names = c(NA, -5L), class = c("tbl_df", "tbl", > "data.frame"), .drop = TRUE)) > > > > > > > > -- > Raoni Rosa Rodrigues > Research Associate of Fish Transposition Center CTPeixes > Universidade Federal de Minas Gerais - UFMG > Brasil > rodrigues.raoni at gmail.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi. Bert's questions should be clarified. But from your question I understand that only ANT01 and ANT02 are the Stations which you want to filter and all others you want to keep regardless of condition. If this is true, I would add the new column which would have one value for ANT stations and different for all others (if you have more than one). Than you could set flag which is the biggest number in each day. And after that you could add in each day stations different from ANT and want to keep. I named your data as test and change them to data frame as I am not familiar with tibbles. The code is like that. test$m <- ave(test$N_records, interaction(test$Date, test$Station), FUN=mean) test$flag <- ave(test$m, test$Date, FUN=function(x) max(x) == x) test$keep <- test$flag + (test$Station == "ETE01")*1 but you need to think about questions asked by Bert. Cheers Petr> -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter > Sent: Thursday, October 31, 2019 5:18 AM > To: Cacique Samurai <caciquesamurai at gmail.com> > Cc: R help <r-help at r-project.org> > Subject: Re: [R] Tricky filtering > > Thanks for the nice dput example, but your specification confuses me. > What if the 2 records with largest Mean_power are not the same as the two > with largest N_records. Do you want to keep all four records? Or various > combinations of this question that would keep 3 records. And will you > always have two records on a date, or could you have just one? And if the2> records with largest Mean_power always also have the largest N_records, > then you only need to choose the two with largest Mean_power and can > ignore the N_records, right? > > Once you have answered these questions -- or someone else has a better > understanding than I -- it should be easy. It will require a loop of oneform or> another, however, and therefore might take a while. > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai > <caciquesamurai at gmail.com> > wrote: > > > Hi all, > > > > I had a fish telemetry data with more then 11 million lines. I had > > some false records in the data, that I have to eliminate. I can solve > > this using a loop, but I think that dplyr:: filter could be faster and > > elegant. I just can't figure out how to do it. > > > > At this moment, I already summarized this raw data, and had something > > like this (dput at end of e-mail): > > > > Date Station Antenna Mean_power N_records *Action need (manually > > inserted)* > > 29/03/2019 ANT01 1 108 1704 Remove > > 29/03/2019 ANT01 2 94 1219 Remove > > 29/03/2019 ANT02 1 220 3029 Keep > > 29/03/2019 ANT02 2 219 2711 Keep > > 30/03/2019 ANT01 1 204 2289 Keep > > 30/03/2019 ANT01 2 172 1477 Keep > > 30/03/2019 ANT02 1 88 913 Remove > > 30/03/2019 ANT02 2 72 1080 Remove > > 30/03/2019 ETE01 AH0 87 1 Keep > > > > The problem occurs between Stations ANT01 and ANT02. In the same day, > > I have to keep the pair of records that have bigger Mean_power and > > more N_records. In this example, I have to keep records in Station > > ANT02 in > > 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than > > ANT01 and > > ANT02 in the same day, it was a simple question. > > > > I have to do this for each marked fish, that is identified by a Code > > supres here for resuming. > > > > Thanks in advanced, > > > > Raoni > > > > > > structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, > > 17985, 17985, 17985, 17985), class = "Date"), Station > > c("ANT01","ANT01", "ANT02", "ANT02", "ANT01", "ANT01", "ANT02", > > "ANT02","ETE01"), Antenna = c("1", "2", "1", "2", "1", "2", "1", > > "2","AH0"), Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), > > N_records = c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, > > 1L)), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df", "tbl", > > "data.frame"), groups = structure(list(Date = structure(c(17984, > > 17984, 17985, 17985, 17985), class = "Date"), Station = c("ANT01", > > "ANT02", "ANT01", "ANT02", "ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8, > > 9L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", > > "data.frame"), .drop = TRUE)) > > > > > > > > > > > > > > > > -- > > Raoni Rosa Rodrigues > > Research Associate of Fish Transposition Center CTPeixes Universidade > > Federal de Minas Gerais - UFMG Brasil rodrigues.raoni at gmail.com > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Bert, thanks for your replay, and sorry for not be so clear. Let?s try: What if the 2 records with largest Mean_power are not the same as the two with largest N_records. Do you want to keep all four records? In the sample data that I used to understand what is going on, this never happened. But, if so, I should ignore N_records and use just Mean_power. Or various combinations of this question that would keep 3 records. No, at this moment, I just need to keep with one record. Maybe in the future I will need to filter the raw data, but now I just need to have one record in ANT01 OR ANT02 per day. And will you always have two records on a date, or could you have just one? Yes, I can have just one record. Probably will be with the ANT that have lower Mean_power. And if the 2 records with largest Mean_power always also have the largest N_records, then you only need to choose the two with largest Mean_power and can ignore the N_records, right? Right, exactly that! Thanks for your attention and help! Raoni Em qui, 31 de out de 2019 ?s 01:17, Bert Gunter <bgunter.4567 at gmail.com> escreveu:> Thanks for the nice dput example, but your specification confuses me. > What if the 2 records with largest Mean_power are not the same as the two > with largest N_records. Do you want to keep all four records? Or various > combinations of this question that would keep 3 records. And will you > always have two records on a date, or could you have just one? And if the 2 > records with largest Mean_power always also have the largest N_records, > then you only need to choose the two with largest Mean_power and can ignore > the N_records, right? > > Once you have answered these questions -- or someone else has a better > understanding than I -- it should be easy. It will require a loop of one > form or another, however, and therefore might take a while. > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai <caciquesamurai at gmail.com> > wrote: > >> Hi all, >> >> I had a fish telemetry data with more then 11 million lines. I had some >> false records in the data, that I have to eliminate. I can solve this >> using >> a loop, but I think that dplyr:: filter could be faster and elegant. I >> just >> can't figure out how to do it. >> >> At this moment, I already summarized this raw data, and had something like >> this (dput at end of e-mail): >> >> Date Station Antenna Mean_power N_records *Action need (manually >> inserted)* >> 29/03/2019 ANT01 1 108 1704 Remove >> 29/03/2019 ANT01 2 94 1219 Remove >> 29/03/2019 ANT02 1 220 3029 Keep >> 29/03/2019 ANT02 2 219 2711 Keep >> 30/03/2019 ANT01 1 204 2289 Keep >> 30/03/2019 ANT01 2 172 1477 Keep >> 30/03/2019 ANT02 1 88 913 Remove >> 30/03/2019 ANT02 2 72 1080 Remove >> 30/03/2019 ETE01 AH0 87 1 Keep >> >> The problem occurs between Stations ANT01 and ANT02. In the same day, I >> have to keep the pair of records that have bigger Mean_power and more >> N_records. In this example, I have to keep records in Station ANT02 in >> 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than ANT01 >> and >> ANT02 in the same day, it was a simple question. >> >> I have to do this for each marked fish, that is identified by a Code >> supres >> here for resuming. >> >> Thanks in advanced, >> >> Raoni >> >> >> structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985, >> 17985, >> 17985, 17985, 17985), class = "Date"), >> Station = c("ANT01","ANT01", "ANT02", "ANT02", "ANT01", "ANT01", "ANT02", >> "ANT02","ETE01"), >> Antenna = c("1", "2", "1", "2", "1", "2", "1", "2","AH0"), >> Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87), N_records >> c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L, 1L)), >> row.names = c(NA, -9L), class = c("grouped_df", "tbl_df", "tbl", >> "data.frame"), >> groups = structure(list(Date = structure(c(17984, 17984, 17985, 17985, >> 17985), class = "Date"), Station = c("ANT01", >> "ANT02", "ANT01", "ANT02", "ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8, >> 9L)), >> row.names = c(NA, -5L), class = c("tbl_df", "tbl", >> "data.frame"), .drop = TRUE)) >> >> >> >> >> >> >> >> -- >> Raoni Rosa Rodrigues >> Research Associate of Fish Transposition Center CTPeixes >> Universidade Federal de Minas Gerais - UFMG >> Brasil >> rodrigues.raoni at gmail.com >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >-- Raoni Rosa Rodrigues Research Associate of Fish Transposition Center CTPeixes Universidade Federal de Minas Gerais - UFMG Brasil rodrigues.raoni at gmail.com [[alternative HTML version deleted]]