Marshall Feldman
2010-Mar-02 16:12 UTC
[R] Reading data file with both fixed and tab-delimited fields
Hello R wizards, What is the best way to read a data file containing both fixed-width and tab-delimited files? (More detail follows.) _*Details:*_ The U.S. Bureau of Labor Statistics provides local area unemployment statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are documented in the file la.txt <ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has five tab-delimited fields: * series_id * year * period (codes for things like quarter or month of year) * value * footnote_codes The series_id consists of five fixed-width subfields (length in parentheses): * survey abbreviation (2) * seasonal code (1) * area type code (2) * area code (6) * measure code (2) So an example record might be: LASPS36040003 1990 M01 8.8 L I want to read in the data in one pass and convert them to a data frame with the following columns (actual name, class in parentheses): Survey abbreviation (survey, character) Seasonal (seasonal, logical seasonal=T) Area type (area_type_code, factor) Area (area_code, factor) Measure (measure_code, factor) Year (year, Date) Period (period, factor) Value (value, numeric) Footnote (footnote_codes, character but see note) (Regarding the Footnote, I have to look at the data more. If there's just one code per record, this will be a factor; if there are multiple, it will either be character or a list. For not I'm making it only character.) Currently I can read the data just fine using read.table, but this makes series_id the first variable. I want to break out the subfields as separate columns. Any suggestions? Thanks. Marsh Feldman [[alternative HTML version deleted]]
Chidambaram Annamalai
2010-Mar-02 17:29 UTC
[R] Reading data file with both fixed and tab-delimited fields
I tried to shoehorn the read.* functions and match both the fixed width and the variable width fields in the data but it doesn't seem evident to me. (read.fwf reads fixed width data properly but the rest of the fields must be processed separately -- maybe insert NULL stubs in the remaining fields and fill them in later?) One way is to sidestep the entire issue and convert the structured data you have into a csv file using sed (usually available on most *nix systems) with something like so: cat data | sed -r 's/^(..)(.)(..)(.{6})(..)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)/\1,\2,\3,\4,\5,\6,\7,\8,\9/' | less and see if the output is alright and use the resulting .csv file directly in R using read.csv If that does not satisfy you maybe the R Wizards on the list might be able to point you to a native R way of doing this possibly using scan? I'm not sure though. Hope this helps, Chillu On Tue, Mar 2, 2010 at 9:42 PM, Marshall Feldman <marsh@uri.edu> wrote:> Hello R wizards, > > What is the best way to read a data file containing both fixed-width and > tab-delimited files? (More detail follows.) > > _*Details:*_ > The U.S. Bureau of Labor Statistics provides local area unemployment > statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are > documented in the file la.txt > <ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has five > tab-delimited fields: > > * series_id > * year > * period (codes for things like quarter or month of year) > * value > * footnote_codes > > The series_id consists of five fixed-width subfields (length in > parentheses): > > * survey abbreviation (2) > * seasonal code (1) > * area type code (2) > * area code (6) > * measure code (2) > > So an example record might be: > > LASPS36040003 1990 M01 8.8 L > > I want to read in the data in one pass and convert them to a data frame > with the following columns (actual name, class in parentheses): > > Survey abbreviation (survey, character) > Seasonal (seasonal, logical seasonal=T) > Area type (area_type_code, factor) > Area (area_code, factor) > Measure (measure_code, factor) > Year (year, Date) > Period (period, factor) > Value (value, numeric) > Footnote (footnote_codes, character but see note) > > (Regarding the Footnote, I have to look at the data more. If there's > just one code per record, this will be a factor; if there are multiple, > it will either be character or a list. For not I'm making it only > character.) > > Currently I can read the data just fine using read.table, but this makes > series_id the first variable. I want to break out the subfields as > separate columns. > > Any suggestions? > > Thanks. > Marsh Feldman > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Reasonably Related Threads
- Showing database results in a grid view
- Warning message: Removed 888 rows containing missing values or values outside the scale range (`geom_line()`)
- How to set a filter during reading tables
- [LLVMdev] [lld] Representation of lld::Reference with a fake target
- [LLVMdev] [lld] Representation of lld::Reference with a fake target