Johannes Habel
2009-Feb-11 16:43 UTC
[R] How to count number of year per firm in panel data?
Hello, I have an unbalanced panel dataset and would like to exclude all objects that don't appear at least x times. Therefore, I would like to include a column indicating for every line how many periods are available, e.g. id, year, number 1, 2000, 3 1, 2001, 3 1, 2002, 3 2, 2001, 1 3, ..., ... This would allow me to exclude companies by setting "subset=number>=x". However, I don't know how to create this column, i.e. how to count the years for each object and include the numbers into the dataset. Could anybody help me, please? Alternatively, is there an easier way to achieve my goal? Thank you very much. Johannes Habel -- View this message in context: http://www.nabble.com/How-to-count-number-of-year-per-firm-in-panel-data--tp21958252p21958252.html Sent from the R help mailing list archive at Nabble.com.
Marc Schwartz
2009-Feb-11 17:38 UTC
[R] How to count number of year per firm in panel data?
on 02/11/2009 10:43 AM Johannes Habel wrote:> Hello, > > I have an unbalanced panel dataset and would like to exclude all objects > that don't appear at least x times. > > Therefore, I would like to include a column indicating for every line how > many periods are available, e.g. > > id, year, number > 1, 2000, 3 > 1, 2001, 3 > 1, 2002, 3 > 2, 2001, 1 > 3, ..., ... > > This would allow me to exclude companies by setting "subset=number>=x". > However, I don't know how to create this column, i.e. how to count the years > for each object and include the numbers into the dataset. > > Could anybody help me, please? > > Alternatively, is there an easier way to achieve my goal? > > Thank you very much. > > Johannes HabelYou don't need to add the extra column. You can just create a frequency table of the unique 'id' values, get the subset of values that meet your count criteria and then use those values in subset(). Let's create a little larger dataset: set.seed(1) id <- sample(letters[1:4], 8, replace = TRUE)> id[1] "b" "b" "c" "d" "a" "d" "d" "c" years <- unlist(lapply(split(id, id), function(i) 2000:(2000 + length(i) - 1)))> yearsa b1 b2 c1 c2 d1 d2 d3 2000 2000 2001 2000 2001 2000 2001 2002 DF <- data.frame(id = sort(id), year = years)> DFid year a a 2000 b1 b 2000 b2 b 2001 c1 c 2000 c2 c 2001 d1 d 2000 d2 d 2001 d3 d 2002> subset(DF, id %in% names(which(table(DF$id) >= 3)))id year d1 d 2000 d2 d 2001 d3 d 2002 Step by step:> table(DF$id)a b c d 1 2 2 3> table(DF$id) >= 3a b c d FALSE FALSE FALSE TRUE> which(table(DF$id) >= 3)d 4> names(which(table(DF$id) >= 3))[1] "d" Then use subset() as above, filtering only those id's that are in the names from the table. If we change the requirement to >= 2:> subset(DF, id %in% names(which(table(DF$id) >= 2)))id year b1 b 2000 b2 b 2001 c1 c 2000 c2 c 2001 d1 d 2000 d2 d 2001 d3 d 2002 See ?table, ?names, ?which and ?"%in%" If you really need to add the column, you could use aggregate() to get a count of years for each id as a data frame, then use merge() to add the column to DF:> aggregate(DF$year, list(id = DF$id), length)id x 1 a 1 2 b 2 3 c 2 4 d 3> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")id year x 1 a 2000 1 2 b 2000 2 3 b 2001 2 4 c 2000 2 5 c 2001 2 6 d 2000 3 7 d 2001 3 8 d 2002 3 Then use subset() as you initially considered. See ?aggregate and ?merge HTH, Marc Schwartz
Possibly Parallel Threads
- how to make aggregation in R ?
- rearrange data
- How to "flatten" a multidimensional array into a dataframe?
- linear model coefficients by year and industry, fitted values, residuals, panel data
- how can I import a number of datsets in a folder in my working directory to a list in R