Thanks for sharing this, Ista.
I've come to the conclusion that R doesn't have what I'm looking
for,
either in the base or the packages.
Although your examples are insightful, the examples we've been
discussing are deliberately easier than what one would expect in most
serious applications. Imagine for instance that we're studying wage
structures of industries in different geographic labor markets. We
therefore might have four variables: wages, industries, occupations, and
places. We might want to see if wage differentials are more or less
constant or if they are higher in some geographic areas than in others.
Since industries, occupations, and places are typically coded
hierarchically as we've been discussing, we might want to figure out how
to examine different wage levels within industries, etc. Doing this
manually would require lots of w
whereas conceptually the
On 5/4/2010 6:00 AM,> Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
> <istazahn@gmail.com> To: Marshall Feldman <marsh@uri.edu> Cc:
> r-help@r-project.org Subject: Re: [R] Hierarchical factors Message-ID:
> <x2xf55e7cf51005031022se4c46967s174efeef95331abc@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
> aware of any packages that implement these features as you described
> them. But most of the tasks are already fairly easy in R -- see below.
> On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <marsh@uri.edu>
wrote:
>> >
>> > Thanks for getting back so quickly Ista,
>> >
>> > I was actually casting about for any examples of R software that
deals with this kind of structure. But your question is a good one. Here are a
few things I'd like to be able to do:
>> >
>> > Store data in R at the finest level of detail but easily refer to
higher levels of aggregation. If the data include such higher levels, this is
trivial, but otherwise I'd like to aggregate fairly easily. The following is
not functioning code, but it should give you the idea:
>> >
>> > start with a data frame (call it d) having row.names = to the 6
digit NAICS code and columns w/ various variables, assume one is named
employment.
>> > d[,"employment"]??? ??? ??? ??? ??? ?? # Would print
all employment data
>> > d["441222","employment"]??? ??? # Would print
only Boat Dealer employment
>> > d["44","employment]??? ??? ??? ???? # Would print
total employment for Retail Trade
>>
> d[,"employment"] #prints all employment data
> d[rownames(d) == "441222","employment"] #prints only
boat dealer employment
> d[grep("^44", rownames(d)),"employment"] # prints total
employment for
> retail trade
>
>
>> >
>> > Recursive nesting. I'm not sure how to convey this except
with examples. Suppose the data frame also has a "wages" column with
average weekly wages in the industry, and the industry code is also a factor
variable (industry). So a simple analysis of variance might look like:
>> >
>> > ??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
>> >
>> > ??? ??? But now what I'd like to do is to break this down
within 2-digit sectors. Assuming the data frame has another variable, industry
2, this would look like:
>> >
>> > ??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
>> >
>> > ???? ??? But what if we either (a) don't want to bother
creating separate variables for each level of aggregation in industry or (b)
want to extended the model formula language to include various nesting
strategies. This might look like:
>> >
>> > ??? ??? ??? ??? ??? w<- aov(wages ~ industry//*)??? ??? ???
??? ??? # Nest all meaningful levels
industry/industry2/industry3/industry4/industry5/industry6. If the coding system
skips some levels, R is smart enough to omit the skipped levels.
>> > ??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels
2,4,6)???? # I'm using "//" as a hypothetical extension to the
model language that is followed by a "levels" keyword and then a list
of levels within the hierarchy. This example would expand
>> > ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ???
??? ??? ??? ??? ??? ?? # to aov(wages ~ industry2/industry4/industry6)
>> >
>> > ??? ??? One could extend this last example to include a notation
allowing the analysis to be repeated at varying levels of depth (e.g.,
industry||2,6) would repeat the ANOVA for industry2 and industry6)
>> >
>>
> I can see how that might be useful. But it is easy enough to split the
> variables out, for example (assuming that each level consists of two
> digits):
>
> d$ind1<- substr(rownames(d), 1,2)
> d$ind2<- substr(rownames(d), 3,4)
> d$ind2<- substr(rownames(d), 5,6)
>
>
>
>> > Since the factor hierarchy is completely nested (i.e., every
6-digit industry is below a 5 digit industry), a single function can operate on
the codes recursively. Three variants come to mind. In the first, we'd use
some kind of apply function to drill down to a certain level and return a list
of results, one for each level:
>> >
>> > ??? ??? ??? ??? ? means<- drill(wages,industry,mean)??? ???
??? ??? ??? ??? # Would return a list. The first component would a vector of
mean wages for industries at the 2-digit level, the second, a vector for the
3-digit level, etc.
>> > ??? ??? ??? ??? ? means<-
drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the 3rd level of the
hierarchy (4-digit code). One could also imagine a maxdigits optionas an
alternative (maxdigits = y means stop at the y-digit level)
>> >
>>
> Again, I can see how this would be useful, but it's already pretty
> easy (once we have split out the grouping variables) to do something
> like
>
> grp.means<- list(
> l1 = aggregate(d$wages, list(d$ind1), mean),
> l2 = aggregate(d$wages, list(d$ind2), mean),
> l3 = aggregate(d$wages, list(d$ind3), mean)
> )
>
> I know this wasn't what you were looking for (as I said, I'm not
aware
> of any package that implements the functionality you describe). But
> the existing facilities in R are quite flexible, and handling this
> kind of data in R is already fairly straightforward.
>
> Best,
> Ista
>
>
>> > Second, suppose we have a data frame like d, only this time
it's a time series (each row is a different date). Now we might want to
generate vectors of the rate of change in employment at each industry level. It
might look like:
>> >
>> > ??? rate<- function(x) { (x - lag(x))/lag(x)) }
>> > ??? rates<- as.list()
>> > ??? i<- 1
>> > ??? rates<- for j %in% levels(industry)? {?? ??? ??? ??? ???
??? ??? ??? ? ?? ??? ??? ??? # The levels function parses the hierarchical
factor into the various levels of its coding system
>> > ??? ??? ??? ??? ??? rates[[i]]<-
rate(emplyment[,level(industry) == j])??? ??? ???? # The level function sets a
particular one of these levels
>> > ??? ??? ??? ??? ??? i<- i + 1
>> > ??? ??? ??? ??? }
>> >
>> > A third variant would be a genuinely recursive function that
keeps on calling itself at each level of the factor until it has either reached
a pre-specified depth or exhausted all levels of the factor.
>> >
>> > I hope this gives you a good idea of the sorts of things one
might do with hierarchical factors.
>> >
>> > ??? Marsh Feldman
>> >
>> >
>> >
>> > On 5/3/2010 9:57 AM, Ista Zahn wrote:
>> >
>> > Hi Marshell,
>> > What exactly do you mean by "handles this kind of data
structure"?
>> > What do you want R to do?
>> >
>> > Best,
>> > Ista
>> >
>> > On Mon, May 3, 2010 at 9:44 AM, Marshall
Feldman<marsh@uri.edu> wrote:
>> >
>> >
>> > Hello,
>> >
>> > Hierarchical factors are a very common data structure. For
instance, one
>> > might have municipalities within states within countries within
>> > continents. Other examples include occupational codes, biological
>> > species, software types (R within statistical software within
analytical
>> > software), etc.
>> >
>> > Such data structures commonly use hierarchical coding systems.
For
>> > example, the 2007 North American Industry Classification System
(NAICS)
>> >
<http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has twenty
>> > two-digit codes (e.g., 42 = Wholesale trade), within each of
these
>> > varying numbers of 3-digit codes (e.g., 423 = Merchant
wholesalers,
>> > durable goods), then varying numbers of 4-digit codes (4231 =
Motor
>> > Vehicle and Motor Vehicle Parts and Supplies Merchant
Wholesalers), then
>> > varying numbers of five-digit codes, varying numbers of six-digit
codes,
>> > etc. At the lowest level (longest code) one can readily tell all
the
>> > higher levels. For example, 441222 is "Boat Dealers"
who are part of
>> > 44122, "Motorcycle, Boat, and Other Motor Vehicle
Dealers," which is
>> > part of 4412 (Other Motor Vehicle Dealers), which is part of 441
(Motor
>> > Vehicle and Parts Dealers), which is part of 44 (Retail Trade).
(The US
>> > Census Bureau has extended the 6-digit NAICS to an even more
>> > fine-grained 10-digit system.)
>> >
>> > I haven't seen any R packages or sample code that handles
this kind of
>> > data, but I don't want to reinvent the wheel and would rather
stand on
>> > the shoulders of you giants. Is there any package or other
R-based
>> > software out there that handles this kind of data structure?
>> >
>> > ? ? Thanks,
>> > ? ? Marsh Feldman
>> >
>> >
>> >
>> >
>> >
>> >
>> > ? ? ? ?[[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting
guidehttp://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible
code.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Dr. Marshall Feldman, PhD
>> > Director of Research and Academic Affairs
>> > Center for Urban Studies and Research
>> > The University of Rhode Island
>> > email: marsh @ uri .edu (remove spaces)
>> >
>> > Contact Information:
>> >
>> > Kingston:
>> >
>> > 202 Hart House
>> > Charles T. Schmidt Labor Research Center
>> > The University of Rhode Island
>> > 36 Upper College Road
>> > Kingston, RI 02881-0815
>> > tel. (401) 874-5953:
>> > fax: (401) 874-5511
>> >
>> > Providence:
>> >
>> > 206E Shepard Building
>> > URI Feinstein Providence Campus
>> > 80 Washington Street
>> > Providence, RI 02903-1819
>> > tel. (401) 277-5218
>> > fax: (401) 277-5464
>>
>
> --
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
>
--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)
Contact Information:
Kingston:
202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511
Providence:
206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464
[[alternative HTML version deleted]]