thr3ads.net - R help - [R] How to set panel data format [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Rui Barradas

2013-Jul-13 16:04 UTC

[R] How to set panel data format

Hello,

It's better if you keep this on the list, the odds of getting more and 
better answers is greater.

Inline.

Em 13-07-2013 15:38, serenamasino at gmail.com escreveu:> Hi Rui,
> thanks for your reply.
>
> No, my problem isn't one of reshaping. It is just that I want R to know
I have a panel and not just cross sections or time series.
>
> In other words If I had cross section data:
>
> COUNTRY   YEAR   GDP
> Albania        1999     3
> Barbados    1999     5
> Congo          1999     1
> Denmark    1999     11
> etc.                ..             ..
>
> My ID here is country, but every observation is a new cluster independent
of each other, so I don't care to let R know because the ID is a unique
identifier.
>
> Whereas if I have a panel
>
> COUNTRY   YEAR   GDP
> Albania        1999      3
> Albania        2000      3.5
> Albania        2001      3.7
> Albania        2002      4
> Albania        2003      4.5
> Barbados   1999       5
> Barbados   2000       5
> Barbados   2001       5.1
> Barbados   2002       4
> Barbados   2003       3
> Congo         1999      1
> Congo         2000      2
> Congo         2001      2
> Congo         2002      3
> Congo         2003      4
> Denmark    1999     11
> Denmark    2000     12
> Denmark    2001     13
> Denmark    2002     10
> Denmark    2003     10
> etc.                ..             ..
>
> How am I going to tell R that Albania is one same ID for all the 5 years I
have in the panel, in other words, Albania has to be identified by the same
number in the "factor" vector which R codes it with. Then Barbados is
ID 2 in all its years, Congo has ID 3 and so on.
R already does that, factors are coded as integers:

as.integer(dat$COUNTRY) # Albania is 1, etc

> In STATA, you sort 'by country year' and the program knows it is a
panel of entities observed more than once over time.  But I am not sure how to
let R know the same.
>
> In practice the reason why it is important to define where a country ends
and where a new begins is because
>
> 1) if one creates lags of variables and the program doesn't know where
the boundaries between countries are, the lag for the first year of Barbados in
my previous example will be calculated using the last year of Albania, that is,
the preceding country.
A way of doing this, equivalent to the previous line of code if the 
countries are grouped consecutively, is

cumsum(c(TRUE, dat$COUNTRY[-nrow(dat)] !=
dat$COUNTRY[-1L]))>
> 2) I need to create countrydummies that take the value of 1 whenever a
country ID is equal to 1, so if Albania has 5 years of observations and each of
the year observations appears with a different ID, the country dummies will not
be created. Instead if Albania has the same country identifier (1) for all the
years in which it is observed, the country dummy will be the same and ==1
whenever Albania is the country observed
I doubt you need to create dummuies, R does it for you when you create a 
factor. internally, factors are coded as integers, so all you need is to 
coerce them to integer like I've said earlier.

Rui Barradas
>
> Hope this makes it clearer,
> Thanks,
> Serena
>
> _____________________________________
> Sent from http://r.789695.n4.nabble.com
>

arun

2013-Jul-13 16:47 UTC

head link

[R] How to set panel data format

Hi,

as.integer(dat$COUNTRY) # would be the easiest (Rui's solution).

Other options could be also used:
library(plyr)
?as.integer(mapvalues(dat$COUNTRY,levels(dat$COUNTRY),seq(length(levels(dat$COUNTRY)))))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
#or
match(dat$COUNTRY,levels(dat$COUNTRY))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4


#if `COUNTRY` is not factor

dat$COUNTRY<- as.character(dat$COUNTRY)
?as.integer(mapvalues(dat$COUNTRY,unique(dat$COUNTRY),seq(length(unique(dat$COUNTRY)))))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

#or (if it is sorted already)
?(seq_along(dat$COUNTRY)-1)%/%as.vector(table(dat$COUNTRY))+1
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
A.K.


----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: serenamasino at gmail.com
Cc: 'r-help' <r-help at r-project.org>
Sent: Saturday, July 13, 2013 12:04 PM
Subject: Re: [R] How to set panel data format

Hello,

It's better if you keep this on the list, the odds of getting more and 
better answers is greater.

Inline.

Em 13-07-2013 15:38, serenamasino at gmail.com escreveu:> Hi Rui,
> thanks for your reply.
>
> No, my problem isn't one of reshaping. It is just that I want R to know
I have a panel and not just cross sections or time series.
>
> In other words If I had cross section data:
>
> COUNTRY?  YEAR?  GDP
> Albania? ? ? ? 1999? ?  3
> Barbados? ? 1999? ?  5
> Congo? ? ? ? ? 1999? ?  1
> Denmark? ? 1999? ?  11
> etc.? ? ? ? ? ? ? ? ..? ? ? ? ? ?  ..
>
> My ID here is country, but every observation is a new cluster independent
of each other, so I don't care to let R know because the ID is a unique
identifier.
>
> Whereas if I have a panel
>
> COUNTRY?  YEAR?  GDP
> Albania? ? ? ? 1999? ? ? 3
> Albania? ? ? ? 2000? ? ? 3.5
> Albania? ? ? ? 2001? ? ? 3.7
> Albania? ? ? ? 2002? ? ? 4
> Albania? ? ? ? 2003? ? ? 4.5
> Barbados?  1999? ? ?  5
> Barbados?  2000? ? ?  5
> Barbados?  2001? ? ?  5.1
> Barbados?  2002? ? ?  4
> Barbados?  2003? ? ?  3
> Congo? ? ? ?  1999? ? ? 1
> Congo? ? ? ?  2000? ? ? 2
> Congo? ? ? ?  2001? ? ? 2
> Congo? ? ? ?  2002? ? ? 3
> Congo? ? ? ?  2003? ? ? 4
> Denmark? ? 1999? ?  11
> Denmark? ? 2000? ?  12
> Denmark? ? 2001? ?  13
> Denmark? ? 2002? ?  10
> Denmark? ? 2003? ?  10
> etc.? ? ? ? ? ? ? ? ..? ? ? ? ? ?  ..
>
> How am I going to tell R that Albania is one same ID for all the 5 years I
have in the panel, in other words, Albania has to be identified by the same
number in the "factor" vector which R codes it with. Then Barbados is
ID 2 in all its years, Congo has ID 3 and so on.
R already does that, factors are coded as integers:

as.integer(dat$COUNTRY) # Albania is 1, etc

> In STATA, you sort 'by country year' and the program knows it is a
panel of entities observed more than once over time.? But I am not sure how to
let R know the same.
>
> In practice the reason why it is important to define where a country ends
and where a new begins is because
>
> 1) if one creates lags of variables and the program doesn't know where
the boundaries between countries are, the lag for the first year of Barbados in
my previous example will be calculated using the last year of Albania, that is,
the preceding country.
A way of doing this, equivalent to the previous line of code if the 
countries are grouped consecutively, is

cumsum(c(TRUE, dat$COUNTRY[-nrow(dat)] !=
dat$COUNTRY[-1L]))>
> 2) I need to create countrydummies that take the value of 1 whenever a
country ID is equal to 1, so if Albania has 5 years of observations and each of
the year observations appears with a different ID, the country dummies will not
be created. Instead if Albania has the same country identifier (1) for all the
years in which it is observed, the country dummy will be the same and ==1
whenever Albania is the country observed
I doubt you need to create dummuies, R does it for you when you create a 
factor. internally, factors are coded as integers, so all you need is to 
coerce them to integer like I've said earlier.

Rui Barradas
>
> Hope this makes it clearer,
> Thanks,
> Serena
>
> _____________________________________
> Sent from http://r.789695.n4.nabble.com
>
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Jul 2013 - How to set panel data format

[R] How to set panel data format

[R] How to set panel data format