Marshall Feldman
2010-Mar-02 16:12 UTC
[R] Reading data file with both fixed and tab-delimited fields
Hello R wizards,
What is the best way to read a data file containing both fixed-width and
tab-delimited files? (More detail follows.)
_*Details:*_
The U.S. Bureau of Labor Statistics provides local area unemployment
statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are
documented in the file la.txt
<ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has five
tab-delimited fields:
* series_id
* year
* period (codes for things like quarter or month of year)
* value
* footnote_codes
The series_id consists of five fixed-width subfields (length in
parentheses):
* survey abbreviation (2)
* seasonal code (1)
* area type code (2)
* area code (6)
* measure code (2)
So an example record might be:
LASPS36040003 1990 M01 8.8 L
I want to read in the data in one pass and convert them to a data frame with the
following columns (actual name, class in parentheses):
Survey abbreviation (survey, character)
Seasonal (seasonal, logical seasonal=T)
Area type (area_type_code, factor)
Area (area_code, factor)
Measure (measure_code, factor)
Year (year, Date)
Period (period, factor)
Value (value, numeric)
Footnote (footnote_codes, character but see note)
(Regarding the Footnote, I have to look at the data more. If there's
just one code per record, this will be a factor; if there are multiple,
it will either be character or a list. For not I'm making it only
character.)
Currently I can read the data just fine using read.table, but this makes
series_id the first variable. I want to break out the subfields as
separate columns.
Any suggestions?
Thanks.
Marsh Feldman
[[alternative HTML version deleted]]
Chidambaram Annamalai
2010-Mar-02 17:29 UTC
[R] Reading data file with both fixed and tab-delimited fields
I tried to shoehorn the read.* functions and match both the fixed width and
the variable width fields
in the data but it doesn't seem evident to me. (read.fwf reads fixed width
data properly but the rest
of the fields must be processed separately -- maybe insert NULL stubs in the
remaining fields and
fill them in later?)
One way is to sidestep the entire issue and convert the structured data you
have into a csv
file using sed (usually available on most *nix systems) with something like
so:
cat data | sed -r 's/^(..)(.)(..)(.{6})(..)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[
\t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)/\1,\2,\3,\4,\5,\6,\7,\8,\9/' |
less
and see if the output is alright and use the resulting .csv file directly in
R using read.csv
If that does not satisfy you maybe the R Wizards on the list might be able
to point you to a
native R way of doing this possibly using scan? I'm not sure though.
Hope this helps,
Chillu
On Tue, Mar 2, 2010 at 9:42 PM, Marshall Feldman <marsh@uri.edu> wrote:
> Hello R wizards,
>
> What is the best way to read a data file containing both fixed-width and
> tab-delimited files? (More detail follows.)
>
> _*Details:*_
> The U.S. Bureau of Labor Statistics provides local area unemployment
> statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are
> documented in the file la.txt
> <ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has
five
> tab-delimited fields:
>
> * series_id
> * year
> * period (codes for things like quarter or month of year)
> * value
> * footnote_codes
>
> The series_id consists of five fixed-width subfields (length in
> parentheses):
>
> * survey abbreviation (2)
> * seasonal code (1)
> * area type code (2)
> * area code (6)
> * measure code (2)
>
> So an example record might be:
>
> LASPS36040003 1990 M01 8.8 L
>
> I want to read in the data in one pass and convert them to a data frame
> with the following columns (actual name, class in parentheses):
>
> Survey abbreviation (survey, character)
> Seasonal (seasonal, logical seasonal=T)
> Area type (area_type_code, factor)
> Area (area_code, factor)
> Measure (measure_code, factor)
> Year (year, Date)
> Period (period, factor)
> Value (value, numeric)
> Footnote (footnote_codes, character but see note)
>
> (Regarding the Footnote, I have to look at the data more. If there's
> just one code per record, this will be a factor; if there are multiple,
> it will either be character or a list. For not I'm making it only
> character.)
>
> Currently I can read the data just fine using read.table, but this makes
> series_id the first variable. I want to break out the subfields as
> separate columns.
>
> Any suggestions?
>
> Thanks.
> Marsh Feldman
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
Seemingly Similar Threads
- Showing database results in a grid view
- Warning message: Removed 888 rows containing missing values or values outside the scale range (`geom_line()`)
- How to set a filter during reading tables
- [LLVMdev] [lld] Representation of lld::Reference with a fake target
- [LLVMdev] [lld] Representation of lld::Reference with a fake target