thr3ads.net - R help - [R] aggregating data with quality control [Aug 2024]

If this information is useful, please help other people find it:
Share via:

Stefano Sofia

2024-Aug-31 11:15 UTC

[R] aggregating data with quality control

Dear R-list users,

I deal with semi-hourly data from automatic meteorological stations.

They have to pass a manual validation; suppose that status = "C"
stands for correct and status = "D" for discarded.

Here a simple example with "Snow height" (HS):


mydf <- data.frame(data_POSIX=seq(as.POSIXct("2024-01-01 00:00:00",
format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"),
as.POSIXct("2024-01-02 23:30:00", format = "%Y-%m-%d
%H:%M:%S", tz="Etc/GMT-1"), by="30 min"))

mydf$hs <- round(runif(96, 0, 100))

mydf$status <- c(rep("C", 50), "S", rep("C",
45))


Evaluating the daily mean indipendently from the status is very easy:

aggregate(mydf$hs, by=list(format(mydf$data_POSIX, "%Y"),
format(mydf$data_POSIX, "%m"), format(mydf$data_POSIX,
"%d")), my.mean)


Things become more complicated when I need to export also the status: this
should be "C" when all 48 data have status equal to "C", and
status "D" when at least one value has status ="D".


I have no clue on how to do that in an efficient way.

Could some of you give me some clues on how to do that?


Thank you for your usual support

Stefano Sofia


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere
informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si ?
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso di necessit? ed
urgenza, la risposta al presente messaggio di posta elettronica pu? essere
visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons
entitled to receive the confidential information it may contain. E-mail messages
to clients of Regione Marche may contain information that is confidential and
legally privileged. Please do not read, copy, forward, or store this message
unless you are an intended recipient of it. If you have received this message in
error, please forward it to the sender and delete it completely from your
computer system.

	[[alternative HTML version deleted]]

Ivan Krylov

2024-Aug-31 11:25 UTC

head link

[R] aggregating data with quality control

? Sat, 31 Aug 2024 11:15:10 +0000
Stefano Sofia <stefano.sofia at regione.marche.it> ?????:
> Evaluating the daily mean indipendently from the status is very easy:
> 
> aggregate(mydf$hs, by=list(format(mydf$data_POSIX, "%Y"),
> format(mydf$data_POSIX, "%m"), format(mydf$data_POSIX,
"%d")),
> my.mean)
> 
> 
> Things become more complicated when I need to export also the status:
> this should be "C" when all 48 data have status equal to
"C", and
> status "D" when at least one value has status ="D".
> 
> 
> I have no clue on how to do that in an efficient way.
You can make the status into an ordered factor:

# come up with some statuses
status <- sample(c('C', 'D'), 42, TRUE, c(.9, .1))

# convert them into factors, specifying that D is "more than" C
status <- ordered(status, c('C', 'D'))

Since the factor is ordered and can be subject to comparison like
status[1] < status[2], you can now use max() on your groups. If the
sample contains any 'D's, max() will return a 'D', because
it's larger
than any 'C's. If the sample contains only 'C's, that's the
maximal
value by default.

-- 
Best regards,
Ivan

Rui Barradas

2024-Aug-31 13:41 UTC

head link

[R] aggregating data with quality control

?s 12:15 de 31/08/2024, Stefano Sofia escreveu:> Dear R-list users,
> 
> I deal with semi-hourly data from automatic meteorological stations.
> 
> They have to pass a manual validation; suppose that status = "C"
stands for correct and status = "D" for discarded.
> 
> Here a simple example with "Snow height" (HS):
> 
> 
> mydf <- data.frame(data_POSIX=seq(as.POSIXct("2024-01-01
00:00:00", format = "%Y-%m-%d %H:%M:%S",
tz="Etc/GMT-1"), as.POSIXct("2024-01-02 23:30:00", format =
"%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"), by="30
min"))
> 
> mydf$hs <- round(runif(96, 0, 100))
> 
> mydf$status <- c(rep("C", 50), "S",
rep("C", 45))
> 
> 
> Evaluating the daily mean indipendently from the status is very easy:
> 
> aggregate(mydf$hs, by=list(format(mydf$data_POSIX, "%Y"),
format(mydf$data_POSIX, "%m"), format(mydf$data_POSIX,
"%d")), my.mean)
> 
> 
> Things become more complicated when I need to export also the status: this
should be "C" when all 48 data have status equal to "C", and
status "D" when at least one value has status ="D".
> 
> 
> I have no clue on how to do that in an efficient way.
> 
> Could some of you give me some clues on how to do that?
> 
> 
> Thank you for your usual support
> 
> Stefano Sofia
> 
> 
>           (oo)
> --oOO--( )--OOo--------------------------------------
> Stefano Sofia PhD
> Civil Protection - Marche Region - Italy
> Meteo Section
> Snow Section
> Via del Colle Ameno 5
> 60126 Torrette di Ancona, Ancona (AN)
> Uff: +39 071 806 7743
> E-mail: stefano.sofia at regione.marche.it
> ---Oo---------oO----------------------------------------
> 
> ________________________________
> 
> AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere
informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si ?
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso di necessit? ed
urgenza, la risposta al presente messaggio di posta elettronica pu? essere
visionata da persone estranee al destinatario.
> IMPORTANT NOTICE: This e-mail message is intended to be received only by
persons entitled to receive the confidential information it may contain. E-mail
messages to clients of Regione Marche may contain information that is
confidential and legally privileged. Please do not read, copy, forward, or store
this message unless you are an intended recipient of it. If you have received
this message in error, please forward it to the sender and delete it completely
from your computer system.
> 
> 	[[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

The aggregate.formula method has a subset argument that you can use to 
extract only the rows matching a condition. The condition below tells if 
there is any "D" and aggregates based on it.
I create a variable subset_condition in order to make the code more 
readable.

First data with no "D"


set.seed(2024)
mydf <- data.frame(data_POSIX = seq(as.POSIXct("2024-01-01
00:00:00",
format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"),
                                     as.POSIXct("2024-01-02 23:30:00",
format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"), by="30
min"))
mydf$hs <- round(runif(96, 0, 100))
mydf$status <- c(rep("C", 50), "S", rep("C",
45))

my.mean <- function(x, na.rm = TRUE) mean(x, na.rm = na.rm)

aggregate(hs ~ format(mydf$data_POSIX, "%Y-%m-%d"), mydf, my.mean)
#>   format(mydf$data_POSIX, "%Y-%m-%d")       hs
#> 1                          2024-01-01 52.37500
#> 2                          2024-01-02 45.64583

subset_condition <- if(any(mydf$status == "D")) mydf$status ==
"D" else TRUE

aggregate(hs ~ format(mydf$data_POSIX, "%Y-%m-%d") + status, mydf, 
my.mean, subset = subset_condition)
#>   format(mydf$data_POSIX, "%Y-%m-%d") status       hs
#> 1                          2024-01-01      C 52.37500
#> 2                          2024-01-02      C 46.48936
#> 3                          2024-01-02      S  6.00000



Now data with "D"'s.


my.mean <- function(x, na.rm = TRUE) mean(x, na.rm = na.rm)

status_with_D <- sample(c('C', 'D'), 45, TRUE, c(.9, .1))
mydf$status <- c(rep("C", 50), "S", status_with_D)

subset_condition <- if(any(mydf$status == "D")) mydf$status ==
"D" else TRUE

aggregate(hs ~ format(data_POSIX, "%Y-%m-%d") + status, mydf, my.mean,
subset = subset_condition)
#>   format(data_POSIX, "%Y-%m-%d") status   hs
#> 1                     2024-01-02      D 51.2

# the formats in the OP but extracted from the date/time and used in the 
formula that follows.
year <- format(mydf$data_POSIX, "%Y")
month <- format(mydf$data_POSIX, "%m")
day <- format(mydf$data_POSIX, "%d")

aggregate(hs ~ year + month + day, mydf, my.mean)
#>   year month day       hs
#> 1 2024    01  01 52.37500
#> 2 2024    01  02 45.64583
aggregate(hs ~ year + month + day + status, mydf, my.mean, subset = 
subset_condition)
#>   year month day status   hs
#> 1 2024    01  02      D 51.2



Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

@vi@e@gross m@iii@g oii gm@ii@com

2024-Aug-31 22:17 UTC

head link

[R] aggregating data with quality control

Stefano,

I see you already have an answer that works for you.

Sometimes you want to step back and see if some modification makes a problem
easier to solve.

I often simply switch to using tools in the tidyverse such as dplyr for parts of
the job albeit much of the same can be done using functions built-in to R.

In your case, there are many possible solutions besides taking the max in some
way as in a factor column.

You seem to expect exactly 48 measurements. Currently you encode them as one of
two character strings but if this is really a binary choice, you could have used
a 0/1 or TRUE/FALSE column instead, or make one. This lets you do things like
take the sum and compare it to 48 to see if all are true, or to zero to check if
all are false. You could take the product to check if at least one is false or
use a negation for another perspective. If the number of rows may not be 48, you
can compare to a calculation of the actual number of rows in that subset.

If your data was placed into wide format, say based on your hs field being
unique for each test site, there are similar ideas by taking a subset of the
columns and applying things  by using functions like rowSum.

Again, some things I commonly use in dplyr such as group_by() and how it impacts
other operations including reports, makes this a little different but most
things can be done with careful use of base R, except areas where dplyr supports
more and more abstract ways to specify what you want and that your example does
not need.

Just FYI, you did not share what your function my.mean() is. 

I won't share the code unless interested but it looks like part of what you
are doing is to bundle by a truncated version of date/time to just a day.  I am
not sure your method is optimal. You make a list of three different things
containing parts of a date. That can work but as dates are already looking like
2024-01-02 which sorts and compares well alphabetically, I wonder if instead you
group by that.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Stefano Sofia
Sent: Saturday, August 31, 2024 7:15 AM
To: r-help at R-project.org
Subject: [R] aggregating data with quality control

Dear R-list users,

I deal with semi-hourly data from automatic meteorological stations.

They have to pass a manual validation; suppose that status = "C"
stands for correct and status = "D" for discarded.

Here a simple example with "Snow height" (HS):


mydf <- data.frame(data_POSIX=seq(as.POSIXct("2024-01-01 00:00:00",
format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"),
as.POSIXct("2024-01-02 23:30:00", format = "%Y-%m-%d
%H:%M:%S", tz="Etc/GMT-1"), by="30 min"))

mydf$hs <- round(runif(96, 0, 100))

mydf$status <- c(rep("C", 50), "S", rep("C",
45))


Evaluating the daily mean indipendently from the status is very easy:

aggregate(mydf$hs, by=list(format(mydf$data_POSIX, "%Y"),
format(mydf$data_POSIX, "%m"), format(mydf$data_POSIX,
"%d")), my.mean)


Things become more complicated when I need to export also the status: this
should be "C" when all 48 data have status equal to "C", and
status "D" when at least one value has status ="D".


I have no clue on how to do that in an efficient way.

Could some of you give me some clues on how to do that?


Thank you for your usual support

Stefano Sofia


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere
informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si ?
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso di necessit? ed
urgenza, la risposta al presente messaggio di posta elettronica pu? essere
visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons
entitled to receive the confidential information it may contain. E-mail messages
to clients of Regione Marche may contain information that is confidential and
legally privileged. Please do not read, copy, forward, or store this message
unless you are an intended recipient of it. If you have received this message in
error, please forward it to the sender and delete it completely from your
computer system.

	[[alternative HTML version deleted]]

R help - Aug 2024 - aggregating data with quality control

[R] aggregating data with quality control

[R] aggregating data with quality control

[R] aggregating data with quality control

[R] aggregating data with quality control