ANGELO.LINARDI at bancaditalia.it
2011-Mar-28 11:56 UTC
[R] "Holes" in a data frame with time intervals
Good morning, I am facing a problem very easy to solve with a program, but not too easy (at least IMHO) with a "declarative" approach. I have a dataframe df with some information about bank branches with a validity time associated (start date/end date, format YYYY-MM-DD) to some attributes (for example number of employees assigned). The following example will clarify this description: BANK_ID BRANCH_ID EMPLOYEE # STARTDATE ENDDATE B1 A15 30 2001-01-01 2001-10-15 B1 A15 28 2001-10-16 2001-12-31 B1 A15 32 2002-01-01 2003-04-18 B1 A15 33 2003-04-19 2004-12-31 B1 A15 29 2005-03-01 2010-12-31 B1 A15 30 2011-02-12 9999-12-31 ........................................................................ ........................................................................ ..................... I would like to find the "missing time intervals" ("holes" - in the example 2005-01-01 to 2005-02-28 and from 2011-01-01 to 2011-02-11). The "programmer's way" would be: * Sort the data by "key" + start date * For each occurrence add 1 day to end date and compare the result with the start date of the following occurrence Can someone help me in finding a "declarative" way to do it ? Thank you in advance Angelo Linardi ** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non sia espressamente previsto da un accordo scritto. Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati potrebbe costituire reato. Grazie per la collaborazione. -- E-mails from the Bank of Italy are sent in good faith but they are neither binding on the Bank nor to be understood as creating any obligation on its part except where provided for in a written agreement. This e-mail is confidential. If you have received it by mistake, please inform the sender by reply e-mail and delete it from your system. Please also note that the unauthorized disclosure or use of the message or any attachments could be an offence. Thank you for your cooperation. ** [[alternative HTML version deleted]]
Does this do what you want:> x <- read.table(textConnection("B1 A15 30 2001-01-01 2001-10-15+ B1 A15 28 2001-10-16 2001-12-31 + B1 A15 32 2002-01-01 2003-04-18 + B1 A15 33 2003-04-19 2004-12-31 + B1 A15 29 2005-03-01 2010-12-31 + B1 A15 30 2011-02-12 9999-12-31"))> closeAllConnections() > x$V4 <- as.Date(x$V4) > x$V5 <- as.Date(x$V5) > # create dataframe with intervals > y <- rbind(cbind(x$V4, 1), cbind(x$V5 + 1, -1)) > y <- y[order(y[,1]),] > y <- cbind(y, count = cumsum(y[,2])) > ycount [1,] 11323 1 1 [2,] 11611 1 2 [3,] 11611 -1 1 [4,] 11688 1 2 [5,] 11688 -1 1 [6,] 12161 1 2 [7,] 12161 -1 1 [8,] 12784 -1 0 [9,] 12843 1 1 [10,] 14975 -1 0 [11,] 15017 1 1 [12,] 2932897 -1 0> # find counts of zero to determine intervals > missing <- which(y[,'count'] == 0) > # remove any index equal to last one > missing <- missing[missing != nrow(y)] > # print intervals > paste(as.Date(y[missing,1], origin = '1970-1-1')+ , 'to' + , as.Date(y[missing + 1, 1] - 1, origin = '1970-1-1') + ) [1] "2005-01-01 to 2005-02-28" "2011-01-01 to 2011-02-11"> >On Mon, Mar 28, 2011 at 7:56 AM, <ANGELO.LINARDI at bancaditalia.it> wrote:> Good morning, > > > > I am facing a problem very easy to solve with a program, but not too > easy (at least IMHO) with a "declarative" approach. > > I have a dataframe df with some information about bank branches with a > validity time associated (start date/end date, format YYYY-MM-DD) to > some attributes (for example number of employees assigned). > > > > The following example will clarify this description: > > > > BANK_ID ? ? BRANCH_ID ? EMPLOYEE # ?STARTDATE ? ENDDATE > > B1 ? ? ? ? ?A15 ? ? ? ? 30 ? ? ? ? ?2001-01-01 ?2001-10-15 > > B1 ? ? ? ? ?A15 ? ? ? ? 28 ? ? ? ? ?2001-10-16 ?2001-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 32 ? ? ? ? ?2002-01-01 ?2003-04-18 > > B1 ? ? ? ? ?A15 ? ? ? ? 33 ? ? ? ? ?2003-04-19 ?2004-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 29 ? ? ? ? ?2005-03-01 ?2010-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 30 ? ? ? ? ?2011-02-12 ?9999-12-31 > > ........................................................................ > ........................................................................ > ..................... > > > > I would like to find the "missing time intervals" ("holes" - in the > example 2005-01-01 to 2005-02-28 and from 2011-01-01 to 2011-02-11). > > The "programmer's way" would be: > > > > * ? ? ? ? Sort the data by "key" + start date > > * ? ? ? ? For each occurrence add 1 day to end date and compare the > result with the start date of the following occurrence > > > > Can someone help me in finding a "declarative" way to do it ? > > > > Thank you in advance > > > > Angelo Linardi > > > > > > > > > > > ** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non > comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non > sia espressamente previsto da un accordo scritto. > Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di > comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La > informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati > potrebbe costituire reato. Grazie per la collaborazione. > -- E-mails from the Bank of Italy are sent in good faith but they are neither binding on > the Bank nor to be understood as creating any obligation on its part except where > provided for in a written agreement. This e-mail is confidential. If you have received it > by mistake, please inform the sender by reply e-mail and delete it from your system. > Please also note that the unauthorized disclosure or use of the message or any > attachments could be an offence. Thank you for your cooperation. ** > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Henrique Dallazuanna
2011-Mar-28 12:41 UTC
[R] "Holes" in a data frame with time intervals
Try this: dif <- na.omit(c(DF$STARTDATE[-1], NA) - DF$ENDDATE) list(DF$ENDDATE[dif != 1] + 1, DF$ENDDATE[dif != 1] + (dif[dif != 1] - 1)) On Mon, Mar 28, 2011 at 8:56 AM, <ANGELO.LINARDI at bancaditalia.it> wrote:> Good morning, > > > > I am facing a problem very easy to solve with a program, but not too > easy (at least IMHO) with a "declarative" approach. > > I have a dataframe df with some information about bank branches with a > validity time associated (start date/end date, format YYYY-MM-DD) to > some attributes (for example number of employees assigned). > > > > The following example will clarify this description: > > > > BANK_ID ? ? BRANCH_ID ? EMPLOYEE # ?STARTDATE ? ENDDATE > > B1 ? ? ? ? ?A15 ? ? ? ? 30 ? ? ? ? ?2001-01-01 ?2001-10-15 > > B1 ? ? ? ? ?A15 ? ? ? ? 28 ? ? ? ? ?2001-10-16 ?2001-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 32 ? ? ? ? ?2002-01-01 ?2003-04-18 > > B1 ? ? ? ? ?A15 ? ? ? ? 33 ? ? ? ? ?2003-04-19 ?2004-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 29 ? ? ? ? ?2005-03-01 ?2010-12-31 > > B1 ? ? ? ? ?A15 ? ? ? ? 30 ? ? ? ? ?2011-02-12 ?9999-12-31 > > ........................................................................ > ........................................................................ > ..................... > > > > I would like to find the "missing time intervals" ("holes" - in the > example 2005-01-01 to 2005-02-28 and from 2011-01-01 to 2011-02-11). > > The "programmer's way" would be: > > > > * ? ? ? ? Sort the data by "key" + start date > > * ? ? ? ? For each occurrence add 1 day to end date and compare the > result with the start date of the following occurrence > > > > Can someone help me in finding a "declarative" way to do it ? > > > > Thank you in advance > > > > Angelo Linardi > > > > > > > > > > > ** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non > comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non > sia espressamente previsto da un accordo scritto. > Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di > comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La > informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati > potrebbe costituire reato. Grazie per la collaborazione. > -- E-mails from the Bank of Italy are sent in good faith but they are neither binding on > the Bank nor to be understood as creating any obligation on its part except where > provided for in a written agreement. This e-mail is confidential. If you have received it > by mistake, please inform the sender by reply e-mail and delete it from your system. > Please also note that the unauthorized disclosure or use of the message or any > attachments could be an offence. Thank you for your cooperation. ** > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O