thr3ads.net - R help - [R] Using factor variables with overlapping categories [Nov 2012]

If this information is useful, please help other people find it:
Share via:

andrewH

2012-Nov-27 23:40 UTC

[R] Using factor variables with overlapping categories

ear folks ?

I have a question, though it is more of a logic- or a good
practices-question than a programming question per se. I am working with
data from the American Community Survey summary file. It is mainly
categorical count data. Currently I am working with about 40 tables covering
about 35 variables, mainly in two-way tables, with some 3-way and a handful
of four-way tables. I am going to be doing a lot of analysis on these
tables, and hope to make them available in zipped format to other R users.
Right now I am keeping this data in single-state data frames, but I will
probably have to shift over to a database if I add many more variables.

Here is my problem: of my 35 variables, five of them are different versions
of age. Different tables cover different age ranges, and have different
levels of disaggregation for the age ranges they cover.

Currently I just have a factor for each with the cut-points in the labels.
But I feel uncomfortable with this. It seems to throw away a lot of
information. There is a ?natural? mapping from the different age ranges to
one another, at least within universes (e.g. individuals vs. heads of
household), and my current approach does not encode that mapping in any way
that R can notice (unless I write special functions that read the labels)

One of the first things I am doing with this data is using all the
cross-tabs to produce some basic estimates of higher-dimensional tabulations
? some 10-way tables covering age, race, sex, age, rent/own, income, etc.
that are consistent with all the lower-dimensional margins, using a
multi-dimensional analogue of the RAS balancing (biproportional matrix
balancing) algorithm often used to update Leontief input-output tables.
Right now the approach I am using is to sum the age variables into four
categories the let me use four of my five age variables, and throw the fifth
(which has inconsistent breakpoints and is used in only one table) away. But
this seems wasteful to me ? not only of one table, but of a lot of
information on finer age sub-structure which is shared by two or more
tables.

I am guessing that this is a fairly common problem in dealing with large
data sets of count objects. Is there a ?standard? approach to is, or a set
of commonly used approaches, that anyone could suggest or point me to? I?d
be happy with either coding suggestions or pointers to the methodology
literature if there is one.

Any help or suggestions would be greatly appreciated. Thanks!

andrewH

--
View this message in context:
http://r.789695.n4.nabble.com/Using-factor-variables-with-overlapping-categories-tp4651054.html
Sent from the R help mailing list archive at Nabble.com.

Jean V Adams

2012-Nov-28 14:47 UTC

head link

[R] Using factor variables with overlapping categories

Andrew,

Interesting issue.  My tack would be to define an age key that 
incorporates all of the different cut-points that are used in your data 
tables.  Then, with the use of some simple functions, you can test which 
factors are "nested" within other factors, and you can broaden those 
categories as needed.  For example:

# define a key to the age categories
age <- 1:30
agef1 <- cut(age, breaks=seq(0, 120, 10))
agef2 <- cut(age, breaks=seq(0, 120, 20))
agef3 <- cut(age, breaks=seq(0, 120, 25))
age.key <- data.frame(age, agef1, agef2, agef3)
rm(age, agef1, agef2, agef3)

# some raw data
mydat <- structure(list(agef1 = structure(c(3L, 2L, 1L, 2L, 2L, 2L, 3L, 
        2L, 1L, 3L, 1L, 2L, 2L, 3L, 2L, 3L, 1L, 3L, 3L, 1L), .Label = 
c("(0,10]", 
        "(10,20]", "(20,30]", "(30,40]",
"(40,50]", "(50,60]", "(60,70]",
        "(70,80]", "(80,90]", "(90,100]",
"(100,110]", "(110,120]"), class
= "factor"), 
        x = c(92, 49, 17, 77, 4, 76, 70, 30, 37, 2, 66, 72, 32, 66, 
        1, 57, 28, 6, 1, 21)), .Names = c("agef1", "x"),
class =
"data.frame", row.names = c("1", 
        "2", "3", "4", "5",
"6", "7", "8", "9", "10",
"11", "12", "13",
        "14", "15", "16", "17",
"18", "19", "20"))

# a function to test if one factor is "nested" in another
# that is, is each value of f1 associated with only one unique value in 
f2?
is.nested <- function(f1, f2) {
        t12 <- table(f1, f2)
        max(apply(t12>0, 1, sum))==1
        }

# a function to broaden the categories of x from those used in f1 to those 
used in f2
broaden <- function(x, f1, f2) {
        if(!is.nested(f1, f2)) stop("f1 not nested within f2")
        f2[match(x, f1)]
        }

is.nested(age.key$agef1, age.key$agef2)
is.nested(age.key$agef1, age.key$agef3)

broaden(mydat$agef1, age.key$agef1, age.key$agef2)
broaden(mydat$agef1, age.key$agef1, age.key$agef3)

Jean



andrewH <ahoerner@rprogress.org> wrote on 11/27/2012 05:40:53
PM:> 
> ear folks ?
> 
> I have a question, though it is more of a logic- or a good
> practices-question than a programming question per se. I am working with
> data from the American Community Survey summary file.  It is mainly
> categorical count data. Currently I am working with about 40 tables 
covering> about 35 variables, mainly in two-way tables, with some 3-way and a 
handful> of four-way tables. I am going to be doing a lot of analysis on these
> tables, and hope to make them available in zipped format to other R 
users. > Right now I am keeping this data in single-state data frames, but I will
> probably have to shift over to a database if I add many more variables.
> 
> Here is my problem: of my 35 variables, five of them are different 
versions> of age. Different tables cover different age ranges, and have different
> levels of disaggregation for the age ranges they cover.
> 
> Currently I just have a factor for each with the cut-points in the 
labels.> But I feel uncomfortable with this. It seems to throw away a lot of
> information. There is a ?natural? mapping from the different age ranges 
to> one another, at least within universes (e.g. individuals vs. heads of
> household), and my current approach does not encode that mapping in any 
way> that R can notice (unless I write special functions that read the 
labels) > 
> One of the first things I am doing with this data is using all the
> cross-tabs to produce some basic estimates of higher-dimensional 
tabulations> ? some 10-way tables covering age, race, sex, age, rent/own, income, 
etc.> that are consistent with all the lower-dimensional margins, using a
> multi-dimensional analogue of the RAS balancing (biproportional matrix
> balancing) algorithm often used to update Leontief input-output tables. 
> Right now the approach I am using is to sum the age variables into four
> categories the let me use four of my five age variables, and throw the 
fifth> (which has inconsistent breakpoints and is used in only one table) away. 
But> this seems wasteful to me ? not only of one table, but of a lot of
> information on finer age sub-structure which is shared by two or more
> tables. 
> 
> I am guessing that this is a fairly common problem in dealing with large
> data sets of count objects. Is there a ?standard? approach to is, or a 
set> of commonly used approaches, that anyone could suggest or point me to? 
I?d> be happy with either coding suggestions or pointers to the methodology
> literature if there is one.
> 
> Any help or suggestions would be greatly appreciated. Thanks! 
> 
> andrewH
	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Nov 2012 - Using factor variables with overlapping categories

[R] Using factor variables with overlapping categories

[R] Using factor variables with overlapping categories

Apparently Analagous Threads