Hello, all
I'm trying to process the names of the variables in the US Census
database, that I'm retrieving with tidycensus. My end goal is to produce
nicely formatted tables with natural labels.
The labels as downloaded from the US Census look like this:
## Get the P1 table for block group 3 in census tract 2711.01:
bg3_race <- get_decennial(
geography = "block group",
state = "MD",
county = "Baltimore city",
table = "P1",
cache_table = TRUE,
year = "2020",
sumfile = "pl")%>%
filter(substr(GEOID, 6, 12) == "2711013")
## Load the names and labels of the variables:
pl_vars <- load_variables(year = "2020", dataset = "pl",
cache = TRUE)
## Join the labels to the variables, and drop the zero counts
bg3_race_sum <- bg3_race %>%
left_join(pl_vars, by=c("variable" = "name")) %>%
filter(value > 0) %>%
select(c(GEOID, value, label))
head(bg3_race_sum$label)
[1] " !!Total:"
[2] " !!Total:!!Population of one race:"
[3] " !!Total:!!Population of one race:!!White alone"
[4] " !!Total:!!Population of one race:!!Black or African American
alone"
[5] " !!Total:!!Population of one race:!!American Indian and Alaska
Native alone"
[6] " !!Total:!!Population of one race:!!Asian alone"
I think my algorithm for the labels is:
1. keep everything from the last "!!" up to and including the last
character
2. for everything remaining, replace each "!!.*:" group with a single
space.
This turns head() into:
"Total:"
" Population of one race:"
" White alone"
" Black or African American alone"
" American Indian and Alaska Native alone"
" Asian alone"
[may not be clearly visible if not rendered in a monospaced font]
I think that I need lapply here, but I'm not sure of that, and of what
to do next. I can split the label using str_split(label, pattern =
"!!")
to get a vector of strings, but don't know how to work on the last
string and all the rest of the strings separately.
Thank you for any suggestions to nudge me along towards a workable solution.
-Kevin