Bert Gunter
2022-Jun-02 20:04 UTC
[R] rbind of multiple data frames by column name, when each data frames can contain different columns
Well, it seems better to me to put all the data frames in long format and then rbind them instead of the other way round, which results in the piles of NA's you see. I note also, FWIW, that this accords with the so-called "tidy" format that many advocate these days. You can always subset (rows) and choose by station, date, etc. as needed, of course from the long format. Because of the regularity of your data frames, it is easy to do this. Here is a little base R function that "reforms" each data frame (I suspect Rui may well provide a more elegant version, though): reform<- function(dat){ nm <- names(dat) stanums <- unique(gsub("[^[:digit:]]","", nm[-1])) ## station numbers present z <- do.call(rbind, lapply(stanums, \(i) structure(dat[,grep(i, nm, fixed = TRUE)], names = c("Hs", "Hn", "flag")))) data.frame(POSIX =rep(dat[,1], length(stanums)), Station = rep(stanums, e = nrow(dat)), z) } e.g.> reform(df2)POSIX Station Hs Hn flag 1 2001-12-01 1 50 20 0 2 2001-12-02 1 60 20 0 3 2001-12-03 1 70 20 0 4 2001-12-04 1 NA NA NA 5 2001-12-05 1 NA NA NA 6 2001-12-01 3 20 0 0 7 2001-12-02 3 20 0 0 8 2001-12-03 3 30 10 0 9 2001-12-04 3 30 0 1 10 2001-12-05 3 0 5 0 A call to rbind() of the following form then gives you all your data in long form(you may wish to use some shortcuts to form the list of frames):> do.call(rbind, lapply(list(df1, df2, df3), reform))POSIX Station Hs Hn flag 1 2000-12-01 1 30 10 0 2 2000-12-02 1 40 20 0 3 2000-12-03 1 50 10 0 4 2000-12-04 1 NA NA NA 5 2000-12-05 1 55 5 0 6 2000-12-01 2 20 0 0 7 2000-12-02 2 20 0 0 8 2000-12-03 2 30 10 0 9 2000-12-04 2 30 0 1 10 2000-12-05 2 0 5 0 11 2001-12-01 1 50 20 0 12 2001-12-02 1 60 20 0 13 2001-12-03 1 70 20 0 14 2001-12-04 1 NA NA NA 15 2001-12-05 1 NA NA NA 16 2001-12-01 3 20 0 0 17 2001-12-02 3 20 0 0 18 2001-12-03 3 30 10 0 19 2001-12-04 3 30 0 1 20 2001-12-05 3 0 5 0 21 2002-12-01 2 50 20 0 22 2002-12-02 2 60 20 0 23 2002-12-03 2 70 20 0 24 2002-12-04 2 NA NA NA 25 2002-12-05 2 NA NA NA 26 2002-12-01 3 20 0 0 27 2002-12-02 3 20 0 0 28 2002-12-03 3 30 10 0 29 2002-12-04 3 30 0 1 30 2002-12-05 3 0 5 0 Cheers, Bert Gunter On Wed, Jun 1, 2022 at 11:13 PM Stefano Sofia < stefano.sofia at regione.marche.it> wrote:> Dear R-list users, > > for each winter season from 2000 to 2022 I have a data frame collecting > for different weather stations snowpack height (Hs), snowfall in the last > 24h (Hn) and a validation flag. > > Suppose I have these three following data frames > > > df1 <- data.frame(data_POSIX=seq(as.POSIXct("2000-12-01", > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2000-12-05", > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station1_Hs = c(30, 40, > 50, NA, 55), Station1_Hn = c(10, 20, 10, NA, 5), Station1_flag = c(0, 0, 0, > NA, 0), Station2_Hs = c(20, 20, 30, 30, 0), Station2_Hn = c(0, 0, 10, 0, > 5), Station2_flag = c(0, 0, 0, 1, 0)) > > > df2 <- data.frame(data_POSIX=seq(as.POSIXct("2001-12-01", > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2001-12-05", > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station1_Hs = c(50, 60, > 70, NA, NA), Station1_Hn = c(20, 20, 20, NA, NA), Station1_flag = c(0, 0, > 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 0), Station3_Hn = c(0, 0, 10, > 0, 5), Station3_flag = c(0, 0, 0, 1, 0)) > > > df3 <- data.frame(data_POSIX=seq(as.POSIXct("2002-12-01", > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2002-12-05", > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station2_Hs = c(50, 60, > 70, NA, NA), Station2_Hn = c(20, 20, 20, NA, NA), Station2_flag = c(0, 0, > 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 0), Station3_Hn = c(0, 0, 10, > 0, 5), Station3_flag = c(0, 0, 0, 1, 0)) > > > As you can see, each data frame can have different stations loaded. > > I would need to call rbind matching data frames by column name (i.e. by > station name), keeping in mind that the number of stations loaded in each > data frame may differ. The result should be > > data_POSIX Station1_Hs Station1_Hn Station1_flag Station2_Hs Station2_Hn > Station2_flag Station3_Hs Station3_Hn Station3_flag > 2000-12-01 30 10 0 20 0 0 NA NA NA > 2000-12-02 40 20 0 20 0 0 NA NA NA > 2000-12-03 50 10 0 30 10 0 NA NA NA > 2000-12-04 NA NA NA 30 0 0 NA NA NA > 2000-12-05 55 5 0 0 5 0 NA NA NA > 2001-12-01 50 20 0 NA NA NA 20 0 0 > 2001-12-02 60 20 0 NA NA NA 20 0 0 > 2001-12-03 70 20 0 NA NA NA 30 10 0 > 2001-12-04 NA NA NA NA NA NA 30 0 1 > 2001-12-05 NA NA NA NA NA NA 0 5 0 > 2002-12-01 NA NA NA 50 20 0 20 0 0 > 2002-12-02 NA NA NA 60 20 0 20 0 0 > 2002-12-03 NA NA NA 70 20 0 30 10 0 > 2002-12-04 NA NA NA NA NA NA 30 0 1 > 2002-12-05 NA NA NA NA NA NA 0 5 0 > > I tried this code > > df_list <- list(df1, df2, df3) > allNms <- unique(unlist(lapply(df_list, names))) > do.call(rbind, c(lapply(df_list, function(x) data.frame(c(x, > sapply(setdiff(allNms, names(x)), function(y) NA)))), make.row.names=FALSE)) > > but I get this error: > Error in (function (..., row.names = NULL, check.rows = FALSE, check.names > = TRUE, : > arguments imply differing number of rows > > Could someone please help me? > > > Thank you for your attention > > Stefano > > > (oo) > --oOO--( )--OOo-------------------------------------- > Stefano Sofia PhD > Civil Protection - Marche Region - Italy > Meteo Section > Snow Section > Via del Colle Ameno 5 > 60126 Torrette di Ancona, Ancona (AN) > Uff: +39 071 806 7743 > E-mail: stefano.sofia at regione.marche.it > ---Oo---------oO---------------------------------------- > > ________________________________ > > AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere > informazioni confidenziali, pertanto ? destinato solo a persone autorizzate > alla ricezione. I messaggi di posta elettronica per i client di Regione > Marche possono contenere informazioni confidenziali e con privilegi legali. > Se non si ? il destinatario specificato, non leggere, copiare, inoltrare o > archiviare questo messaggio. Se si ? ricevuto questo messaggio per errore, > inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio > computer. Ai sensi dell'art. 6 della DGR n. 1394/2008 si segnala che, in > caso di necessit? ed urgenza, la risposta al presente messaggio di posta > elettronica pu? essere visionata da persone estranee al destinatario. > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages to clients of Regione Marche may contain information that > is confidential and legally privileged. Please do not read, copy, forward, > or store this message unless you are an intended recipient of it. If you > have received this message in error, please forward it to the sender and > delete it completely from your computer system. > > -- > Questo messaggio stato analizzato da Libraesva ESG ed risultato non > infetto. > This message was scanned by Libraesva ESG and is believed to be clean. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
CALUM POLWART
2022-Jun-03 07:48 UTC
[R] rbind of multiple data frames by column name, when each data frames can contain different columns
Bert! It sounds like you are warming to the the tidyverse! ;-) I completed agree with your analysis this data would be best served long as you have shown. If the OP was going to use tidy to manipulate it, they can do the same with tidyr::pivot_ functions (pivot_longer and pivot_wider). The result will be the same - but understanding how it happened may be easier! On Thu, 2 Jun 2022, 21:04 Bert Gunter, <bgunter.4567 at gmail.com> wrote:> Well, it seems better to me to put all the data frames in long format and > then rbind them instead of the other way round, which results in the piles > of NA's you see. I note also, FWIW, that this accords with the so-called > "tidy" format that many advocate these days. You can always subset (rows) > and choose by station, date, etc. as needed, of course from the long > format. > > Because of the regularity of your data frames, it is easy to do this. Here > is a little base R function that "reforms" each data frame (I suspect Rui > may well provide a more elegant version, though): > > > reform<- function(dat){ > nm <- names(dat) > stanums <- unique(gsub("[^[:digit:]]","", nm[-1])) ## station numbers > present > z <- do.call(rbind, lapply(stanums, > \(i) > structure(dat[,grep(i, nm, fixed = TRUE)], > names = c("Hs", "Hn", "flag")))) > data.frame(POSIX =rep(dat[,1], length(stanums)), > Station = rep(stanums, e = nrow(dat)), > z) > } > > e.g. > > reform(df2) > POSIX Station Hs Hn flag > 1 2001-12-01 1 50 20 0 > 2 2001-12-02 1 60 20 0 > 3 2001-12-03 1 70 20 0 > 4 2001-12-04 1 NA NA NA > 5 2001-12-05 1 NA NA NA > 6 2001-12-01 3 20 0 0 > 7 2001-12-02 3 20 0 0 > 8 2001-12-03 3 30 10 0 > 9 2001-12-04 3 30 0 1 > 10 2001-12-05 3 0 5 0 > > A call to rbind() of the following form then gives you all your data in > long form(you may wish to use some shortcuts to form the list of frames): > > > do.call(rbind, lapply(list(df1, df2, df3), reform)) > POSIX Station Hs Hn flag > 1 2000-12-01 1 30 10 0 > 2 2000-12-02 1 40 20 0 > 3 2000-12-03 1 50 10 0 > 4 2000-12-04 1 NA NA NA > 5 2000-12-05 1 55 5 0 > 6 2000-12-01 2 20 0 0 > 7 2000-12-02 2 20 0 0 > 8 2000-12-03 2 30 10 0 > 9 2000-12-04 2 30 0 1 > 10 2000-12-05 2 0 5 0 > 11 2001-12-01 1 50 20 0 > 12 2001-12-02 1 60 20 0 > 13 2001-12-03 1 70 20 0 > 14 2001-12-04 1 NA NA NA > 15 2001-12-05 1 NA NA NA > 16 2001-12-01 3 20 0 0 > 17 2001-12-02 3 20 0 0 > 18 2001-12-03 3 30 10 0 > 19 2001-12-04 3 30 0 1 > 20 2001-12-05 3 0 5 0 > 21 2002-12-01 2 50 20 0 > 22 2002-12-02 2 60 20 0 > 23 2002-12-03 2 70 20 0 > 24 2002-12-04 2 NA NA NA > 25 2002-12-05 2 NA NA NA > 26 2002-12-01 3 20 0 0 > 27 2002-12-02 3 20 0 0 > 28 2002-12-03 3 30 10 0 > 29 2002-12-04 3 30 0 1 > 30 2002-12-05 3 0 5 0 > > > Cheers, > Bert Gunter > > > > > On Wed, Jun 1, 2022 at 11:13 PM Stefano Sofia < > stefano.sofia at regione.marche.it> wrote: > > > Dear R-list users, > > > > for each winter season from 2000 to 2022 I have a data frame collecting > > for different weather stations snowpack height (Hs), snowfall in the last > > 24h (Hn) and a validation flag. > > > > Suppose I have these three following data frames > > > > > > df1 <- data.frame(data_POSIX=seq(as.POSIXct("2000-12-01", > > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2000-12-05", > > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station1_Hs = c(30, 40, > > 50, NA, 55), Station1_Hn = c(10, 20, 10, NA, 5), Station1_flag = c(0, 0, > 0, > > NA, 0), Station2_Hs = c(20, 20, 30, 30, 0), Station2_Hn = c(0, 0, 10, 0, > > 5), Station2_flag = c(0, 0, 0, 1, 0)) > > > > > > df2 <- data.frame(data_POSIX=seq(as.POSIXct("2001-12-01", > > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2001-12-05", > > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station1_Hs = c(50, 60, > > 70, NA, NA), Station1_Hn = c(20, 20, 20, NA, NA), Station1_flag = c(0, 0, > > 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 0), Station3_Hn = c(0, 0, 10, > > 0, 5), Station3_flag = c(0, 0, 0, 1, 0)) > > > > > > df3 <- data.frame(data_POSIX=seq(as.POSIXct("2002-12-01", > > format="%Y-%m-%d", tz="Etc/GMT-1"), as.POSIXct("2002-12-05", > > format="%Y-%m-%d", tz="Etc/GMT-1"), by="1 days"), Station2_Hs = c(50, 60, > > 70, NA, NA), Station2_Hn = c(20, 20, 20, NA, NA), Station2_flag = c(0, 0, > > 0, NA, NA), Station3_Hs = c(20, 20, 30, 30, 0), Station3_Hn = c(0, 0, 10, > > 0, 5), Station3_flag = c(0, 0, 0, 1, 0)) > > > > > > As you can see, each data frame can have different stations loaded. > > > > I would need to call rbind matching data frames by column name (i.e. by > > station name), keeping in mind that the number of stations loaded in each > > data frame may differ. The result should be > > > > data_POSIX Station1_Hs Station1_Hn Station1_flag Station2_Hs Station2_Hn > > Station2_flag Station3_Hs Station3_Hn Station3_flag > > 2000-12-01 30 10 0 20 0 0 NA NA NA > > 2000-12-02 40 20 0 20 0 0 NA NA NA > > 2000-12-03 50 10 0 30 10 0 NA NA NA > > 2000-12-04 NA NA NA 30 0 0 NA NA NA > > 2000-12-05 55 5 0 0 5 0 NA NA NA > > 2001-12-01 50 20 0 NA NA NA 20 0 0 > > 2001-12-02 60 20 0 NA NA NA 20 0 0 > > 2001-12-03 70 20 0 NA NA NA 30 10 0 > > 2001-12-04 NA NA NA NA NA NA 30 0 1 > > 2001-12-05 NA NA NA NA NA NA 0 5 0 > > 2002-12-01 NA NA NA 50 20 0 20 0 0 > > 2002-12-02 NA NA NA 60 20 0 20 0 0 > > 2002-12-03 NA NA NA 70 20 0 30 10 0 > > 2002-12-04 NA NA NA NA NA NA 30 0 1 > > 2002-12-05 NA NA NA NA NA NA 0 5 0 > > > > I tried this code > > > > df_list <- list(df1, df2, df3) > > allNms <- unique(unlist(lapply(df_list, names))) > > do.call(rbind, c(lapply(df_list, function(x) data.frame(c(x, > > sapply(setdiff(allNms, names(x)), function(y) NA)))), > make.row.names=FALSE)) > > > > but I get this error: > > Error in (function (..., row.names = NULL, check.rows = FALSE, > check.names > > = TRUE, : > > arguments imply differing number of rows > > > > Could someone please help me? > > > > > > Thank you for your attention > > > > Stefano > > > > > > (oo) > > --oOO--( )--OOo-------------------------------------- > > Stefano Sofia PhD > > Civil Protection - Marche Region - Italy > > Meteo Section > > Snow Section > > Via del Colle Ameno 5 > > 60126 Torrette di Ancona, Ancona (AN) > > Uff: +39 071 806 7743 > > E-mail: stefano.sofia at regione.marche.it > > ---Oo---------oO---------------------------------------- > > > > ________________________________ > > > > AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere > > informazioni confidenziali, pertanto ? destinato solo a persone > autorizzate > > alla ricezione. I messaggi di posta elettronica per i client di Regione > > Marche possono contenere informazioni confidenziali e con privilegi > legali. > > Se non si ? il destinatario specificato, non leggere, copiare, inoltrare > o > > archiviare questo messaggio. Se si ? ricevuto questo messaggio per > errore, > > inoltrarlo al mittente ed eliminarlo completamente dal sistema del > proprio > > computer. Ai sensi dell'art. 6 della DGR n. 1394/2008 si segnala che, in > > caso di necessit? ed urgenza, la risposta al presente messaggio di posta > > elettronica pu? essere visionata da persone estranee al destinatario. > > IMPORTANT NOTICE: This e-mail message is intended to be received only by > > persons entitled to receive the confidential information it may contain. > > E-mail messages to clients of Regione Marche may contain information that > > is confidential and legally privileged. Please do not read, copy, > forward, > > or store this message unless you are an intended recipient of it. If you > > have received this message in error, please forward it to the sender and > > delete it completely from your computer system. > > > > -- > > Questo messaggio stato analizzato da Libraesva ESG ed risultato non > > infetto. > > This message was scanned by Libraesva ESG and is believed to be clean. > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]