Christian Schoder
2010-May-28 21:58 UTC
[R] Drop observations in unbalanced panel data set according to missing values
Dear R-users, I use firm-level data in panel structure. I would like to drop all firms that have less than x observations over the time scale in any of the variables considered. I would appreciate any help that (a) indicates relevant literature or websites or (b) indicates the code that could solve the problem. Here, a detailed illustration of my problem: My data set is of the form> dfid y z 1 a 1 1 2 b NA 2 3 b 3 3 4 c 2 2 5 c 4 4 6 c 5 NA 7 d 6 NA 8 d 5 5 9 d 6 6 10 d 7 7 11 e NA NA 12 e NA 4 13 e 3 3 where id is the index of the firm, and y and z are observations such as assets and sales. Now I would like to apply a procedure that drops all firms which have less then 2 observed realizations in y or z. Thus, it should give me a data.frame which looks like> df1id y z 1 c 2 2 2 c 4 4 3 c 5 NA 4 d 6 NA 5 d 5 5 6 d 6 6 7 d 7 7 Thank you very much! Christian Schoder
David Winsemius
2010-May-28 22:28 UTC
[R] Drop observations in unbalanced panel data set according to missing values
On May 28, 2010, at 5:58 PM, Christian Schoder wrote:> Dear R-users, > > I use firm-level data in panel structure. I would like to drop all > firms that have less than x observations over the time scale in any > of the variables considered. I would appreciate any help that (a) > indicates relevant literature or websites or (b) indicates the code > that could solve the problem. > > Here, a detailed illustration of my problem: My data set is of the > form >> df > id y z > 1 a 1 1 > 2 b NA 2 > 3 b 3 3 > 4 c 2 2 > 5 c 4 4 > 6 c 5 NA > 7 d 6 NA > 8 d 5 5 > 9 d 6 6 > 10 d 7 7 > 11 e NA NA > 12 e NA 4 > 13 e 3 3 > where id is the index of the firm, and y and z are observations such > as assets and sales. Now I would like to apply a procedure that > drops all firms which have less then 2 observed realizations in y or > z.I try to avoid naming objects with common function names like df: > dfrm$nrecy <- ave(dfrm$y , dfrm$id, FUN=function(x) sum(!is.na(x)) ) > dfrm$nrecz <- ave(dfrm$z , dfrm$id, FUN=function(x) sum(!is.na(x)) ) > dfrm id y z nrecy nrecz 1 a 1 1 1 1 2 b NA 2 1 2 3 b 3 3 1 2 4 c 2 2 3 2 5 c 4 4 3 2 6 c 5 NA 3 2 7 d 6 NA 4 3 8 d 5 5 4 3 9 d 6 6 4 3 10 d 7 7 4 3 11 e NA NA 1 2 12 e NA 4 1 2 13 e 3 3 1 2 > dfrm[with(dfrm, pmin(nrecy, nrecz)>1), ] id y z nrecy nrecz 4 c 2 2 3 2 5 c 4 4 3 2 6 c 5 NA 3 2 7 d 6 NA 4 3 8 d 5 5 4 3 9 d 6 6 4 3 10 d 7 7 4 3 Now it does not thereby assure that you will have at least 2 of each id with complete observationssince. But if you wanted a solution to that problem you would need a better testing data.frame.> Thus, it should give me a data.frame which looks like >> df1 > id y z > 1 c 2 2 > 2 c 4 4 > 3 c 5 NA > 4 d 6 NA > 5 d 5 5 > 6 d 6 6 > 7 d 7 7 > > Thank you very much! > Christian Schoder > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT