Hi,
I've been trying out the plm package, which seems like a great boon to
those who want to analyze panel data in R. I haven't started to use the
estimation functions themselves - for now I am just interested in having
a robust way to deal with lags in unbalanced panel data, since it is
such a royal pain to deal with all the special cases.
However, In my tests, I found behavior that seems strange at a minimum,
and potentially a bug, and I would like to understand it better. I
demonstrate these using an example, which I include below.
Basically, I want the function to deal "correctly" with a panel that
contains a unit (unit 1 in the example), that has a gap (missing entry
for a particular point in time (year 4 in the example)).
What the example demonstrates is that the outcome when the unit 1
observations are lagged is different based on whether year 4 is present
in the observations on *unit 2*.
If year 4 is present for unit 2, the lagged pseries is suitable for
binding to the original data frame, with missing values at the correct
locations. The names() for the lagged series are incorrect, but I don't
really care about them. So this is basically the behavior I had hoped to
see.
However, if year 4 is not present in unit2, the gap is not detected. A
cbind() with the original series is now incorrect data, although the
names() of the lagged series could now be interpreted as correct in the
strictest sense. However, if this is the expected behavior, that means
that all estimation functions will have to examine the names() of each
series, which seems like a lot of work.
My question then is: How should I interpret these results? Are gaps in
the data disallowed? And if so, should the creation of a pdata.frame
with gaps result in an error?
MY EXAMPLE FOLLOWS:
> # Construct pdata.frame with gap in unit 1, at year 4
> pdu <- pdata.frame(
+ data.frame(
+ i=c(rep(1,6),rep(2,3)),
+ t=c(1:3,5:7,2:4),
+ x=1:9
+ )
+ )
> # Using cbind() to view the series with its lagged
> # counterpart produces the expected result
> cbind( pdu$i, pdu$t, pdu$x , lag(pdu$x))
[,1] [,2] [,3] [,4]
1-1 1 1 1 NA
1-2 1 2 2 1
1-3 1 3 3 2
1-5 1 5 4 NA
1-6 1 6 5 4
1-7 1 7 6 5
2-2 2 2 7 NA
2-3 2 3 8 7
2-4 2 4 9 8
> # The labels of the lagged series do not seem correct
> # (the second NA should be labeled as 1-4, not as 1-3),
> lag(pdu$x)
1-1 1-2 1-3 1-5 1-6 1-7 2-2 2-3
NA 1 2 NA 4 5 NA 7 8
attr(,"class")
[1] "integer"
>
> # Again, construct pdata.frame with (the same) gap in
> # unit 1, but now, that time observation (4), is
> # not present in unit 2 either
> pdu <- pdata.frame(
+ data.frame(
+ i=c(rep(1,6),rep(2,3)),
+ t=c(1:3,5:7,1:3),
+ x=1:9
+ )
+ )
> # Now the cbin() of the two series seems wrong
> cbind( pdu$i, pdu$t, pdu$x , lag(pdu$x))
[,1] [,2] [,3] [,4]
1-1 1 1 1 NA
1-2 1 2 2 1
1-3 1 3 3 2
1-5 1 4 4 3
1-6 1 5 5 4
1-7 1 6 6 5
2-1 2 1 7 NA
2-2 2 2 8 7
2-3 2 3 9 8
> # But the labels of the lagged series could be
> # interpreted as being correct in this case
> lag(pdu$x)
1-1 1-2 1-3 1-5 1-6 1-7 2-1 2-2
NA 1 2 3 4 5 NA 7 8
attr(,"class")
[1] "integer"
>
Best regards,
Magnus