Christian Schoder
2010-Jul-22 09:18 UTC
[R] Drop firms in unbalanced panel if not more than 5 observations in consecutive years for all variables
Dear R-user, a few weeks ago I consulted the list-serve with a similar question. However, my task changed a little but sufficiently to get lost again. So I would appreciate any help on the following issue. I use the plm package and work with firm-level data in a panel. I would like to eliminate all firms that do not fulfill the requirement of having an observation in every variable used for at least x consecutive years. For illustration of the problem assume the following data set> dataid year y z 1 a 2000 1 1 2 b 2000 NA 2 3 b 2001 3 3 4 c 1999 1 1 5 c 2000 2 2 6 c 2001 4 NA 7 c 2002 5 4 8 d 1998 6 5 9 d 1999 5 NA 10 d 2000 6 6 11 d 2001 7 7 12 d 2002 3 6 where id is the index of the firm, year the index for the year, and y and z are variables. Now, I would like to get rid of all firms with, let's say, less than 3 consecutive years in which there are observations for every variable. Hence, the procedure should yield> data.reducedid year y z 1 d 1998 6 5 2 d 1999 5 NA 3 d 2000 6 6 4 d 2001 7 7 5 d 2002 3 6 Thank you very much for any help! Cheers, Christian
Stefan Grosse
2010-Jul-22 10:28 UTC
[R] Drop firms in unbalanced panel if not more than 5 observations in consecutive years for all variables
Am 22.07.2010 11:18, schrieb Christian Schoder:> I use the plm package and work with firm-level data in a panel. I would > like to eliminate all firms that do not fulfill the requirement of > having an observation in every variable used for at least x consecutive > years.There are probably more ellegant solutions but you could do something like: test<-data.frame(year=c(2001,2001,2002,2001,2002,2003),id=c(1,2,2,3,3,3),x=runif(6)) test1<-aggregate(test,by=list(id=test$id),FUN=c("length","mean")) test2<-merge(test,test1[,c("id","x")],by.x="id",by.y="id") names(test2)<-c(names(test),"count") testsub<-subset(test2,count>2) testsub hth Stefan
Allan Engelhardt
2010-Jul-22 10:34 UTC
[R] Drop firms in unbalanced panel if not more than 5 observations in consecutive years for all variables
One option: /## Your data:/ data <- structure(list(id = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), year = c(2000L, 2000L, 2001L, 1999L, 2000L, 2001L, 2002L, 1998L, 1999L, 2000L, 2001L, 2002L), y = c(1L, NA, 3L, 1L, 2L, 4L, 5L, 6L, 5L, 6L, 7L, 3L), z = c(1L, 2L, 3L, 1L, 2L, NA, 4L, 5L, NA, 6L, 7L, 6L)), .Names = c("id", "year", "y", "z"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")) /## Drop missing data/ Data <- na.omit(data) /## Find three years in a row with data/ n <- names(which(apply(table(Data$id, Data$year), 1, function(x) {z<-rle(x); z<-z[z$values>0]; any(z$lengths>=3)}))) /## Now get the original data for these ids/ data[data$id %in% n, ] /# id year y z # 8 d 1998 6 5 # 9 d 1999 5 NA # 10 d 2000 6 6 # 11 d 2001 7 7 # 12 d 2002 3 6/ This is probably not optimal, but I hope it helps a little. Allan On 22/07/10 10:18, Christian Schoder wrote:> Dear R-user, > > a few weeks ago I consulted the list-serve with a similar question. > However, my task changed a little but sufficiently to get lost again. So > I would appreciate any help on the following issue. > > I use the plm package and work with firm-level data in a panel. I would > like to eliminate all firms that do not fulfill the requirement of > having an observation in every variable used for at least x consecutive > years. > > For illustration of the problem assume the following data set > >> data >> > id year y z > 1 a 2000 1 1 > 2 b 2000 NA 2 > 3 b 2001 3 3 > 4 c 1999 1 1 > 5 c 2000 2 2 > 6 c 2001 4 NA > 7 c 2002 5 4 > 8 d 1998 6 5 > 9 d 1999 5 NA > 10 d 2000 6 6 > 11 d 2001 7 7 > 12 d 2002 3 6 > where id is the index of the firm, year the index for the year, and y > and z are variables. Now, I would like to get rid of all firms with, > let's say, less than 3 consecutive years in which there are observations > for every variable. Hence, the procedure should yield > >> data.reduced >> > id year y z > 1 d 1998 6 5 > 2 d 1999 5 NA > 3 d 2000 6 6 > 4 d 2001 7 7 > 5 d 2002 3 6 > > Thank you very much for any help! > > Cheers, Christian > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Dimitris Rizopoulos
2010-Jul-22 11:34 UTC
[R] Drop firms in unbalanced panel if not more than 5 observations in consecutive years for all variables
try this: Dat <- read.table(textConnection( "id year y z 1 a 2000 1 1 2 b 2000 NA 2 3 b 2001 3 3 4 c 1999 1 1 5 c 2000 2 2 6 c 2001 4 NA 7 c 2002 5 4 8 d 1998 6 5 9 d 1999 5 NA 10 d 2000 6 6 11 d 2001 7 7 12 d 2002 3 6" ), header = TRUE) closeAllConnections() n.years <- 3 # the threshold na.ind <- !rowSums(is.na(Dat[-(1:2)])) # the variables of interest ind <- ave(na.ind, Dat$id, FUN = function (x) any(cumsum(x) > n.years)) Dat[ind, ] I hope it helps. Best, Dimitris On 7/22/2010 11:18 AM, Christian Schoder wrote:> Dear R-user, > > a few weeks ago I consulted the list-serve with a similar question. > However, my task changed a little but sufficiently to get lost again. So > I would appreciate any help on the following issue. > > I use the plm package and work with firm-level data in a panel. I would > like to eliminate all firms that do not fulfill the requirement of > having an observation in every variable used for at least x consecutive > years. > > For illustration of the problem assume the following data set >> data > id year y z > 1 a 2000 1 1 > 2 b 2000 NA 2 > 3 b 2001 3 3 > 4 c 1999 1 1 > 5 c 2000 2 2 > 6 c 2001 4 NA > 7 c 2002 5 4 > 8 d 1998 6 5 > 9 d 1999 5 NA > 10 d 2000 6 6 > 11 d 2001 7 7 > 12 d 2002 3 6 > where id is the index of the firm, year the index for the year, and y > and z are variables. Now, I would like to get rid of all firms with, > let's say, less than 3 consecutive years in which there are observations > for every variable. Hence, the procedure should yield >> data.reduced > id year y z > 1 d 1998 6 5 > 2 d 1999 5 NA > 3 d 2000 6 6 > 4 d 2001 7 7 > 5 d 2002 3 6 > > Thank you very much for any help! > > Cheers, Christian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014
Gabor Grothendieck
2010-Jul-22 11:40 UTC
[R] Drop firms in unbalanced panel if not more than 5 observations in consecutive years for all variables
On Thu, Jul 22, 2010 at 5:18 AM, Christian Schoder <schoc152 at newschool.edu> wrote:> Dear R-user, > > a few weeks ago I consulted the list-serve with a similar question. > However, my task changed a little but sufficiently to get lost again. So > I would appreciate any help on the following issue. > > I use the plm package and work with firm-level data in a panel. I would > like to eliminate all firms that do not fulfill the requirement of > having an observation in every variable used for at least x consecutive > years. > > For illustration of the problem assume the following data set >> data > ? id year ?y ?z > 1 ? a 2000 ?1 ?1 > 2 ? b 2000 NA ?2 > 3 ? b 2001 ?3 ?3 > 4 ? c 1999 ?1 ?1 > 5 ? c 2000 ?2 ?2 > 6 ? c 2001 ?4 NA > 7 ? c 2002 ?5 ?4 > 8 ? d 1998 ?6 ?5 > 9 ? d 1999 ?5 NA > 10 ?d 2000 ?6 ?6 > 11 ?d 2001 ?7 ?7 > 12 ?d 2002 ?3 ?6 > where id is the index of the firm, year the index for the year, and y > and z are variables. Now, I would like to get rid of all firms with, > let's say, less than 3 consecutive years in which there are observations > for every variable. Hence, the procedure should yield >> data.reduced > ? id year ?y ?z > 1 ? d 1998 ?6 ?5 > 2 ? d 1999 ?5 NA > 3 ? d 2000 ?6 ?6 > 4 ? d 2001 ?7 ?7 > 5 ? d 2002 ?3 ?6 >Try this: do.call(rbind, by(DF, DF$id, function(x) if (length(na.contiguous(x$y * x$z)) >= 3) x ))