thr3ads.net - R help - [R] Reading data file with both fixed and tab-delimited fields [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Marshall Feldman

2010-Mar-02 16:12 UTC

[R] Reading data file with both fixed and tab-delimited fields

Hello R wizards,

What is the best way to read a data file containing both fixed-width and 
tab-delimited files? (More detail follows.)

_*Details:*_
The U.S. Bureau of Labor Statistics provides local area unemployment 
statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are 
documented in the file la.txt 
<ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has five 
tab-delimited fields:

    * series_id
    * year
    * period (codes for things like quarter or month of year)
    * value
    * footnote_codes

The series_id consists of five fixed-width subfields (length in 
parentheses):

    * survey abbreviation (2)
    * seasonal code (1)
    * area type code (2)
    * area code (6)
    * measure code (2)

So an example record might be:

LASPS36040003	1990	M01	8.8	L

I want to read in the data in one pass and convert them to a data frame with the
following columns (actual name, class in parentheses):

    Survey abbreviation (survey, character)
    Seasonal (seasonal, logical seasonal=T)
    Area type (area_type_code, factor)
    Area (area_code, factor)
    Measure (measure_code, factor)
    Year (year, Date)
    Period (period, factor)
    Value (value, numeric)
    Footnote (footnote_codes, character but see note)

(Regarding the Footnote, I have to look at the data more. If there's 
just one code per record, this will be a factor; if there are multiple, 
it will either be character or a list. For not I'm making it only 
character.)

Currently I can read the data just fine using read.table, but this makes 
series_id the first variable. I want to break out the subfields as 
separate columns.

Any suggestions?

Thanks.
     Marsh Feldman




	[[alternative HTML version deleted]]

Chidambaram Annamalai

2010-Mar-02 17:29 UTC

head link

[R] Reading data file with both fixed and tab-delimited fields

I tried to shoehorn the read.* functions and match both the fixed width and
the variable width fields
in the data but it doesn't seem evident to me. (read.fwf reads fixed width
data properly but the rest
of the fields must be processed separately -- maybe insert NULL stubs in the
remaining fields and
fill them in later?)

One way is to sidestep the entire issue and convert the structured data you
have into a csv
file using sed (usually available on  most *nix systems) with something like
so:

cat data | sed -r 's/^(..)(.)(..)(.{6})(..)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[
\t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)/\1,\2,\3,\4,\5,\6,\7,\8,\9/' |
less

and see if the output is alright and use the resulting .csv file directly in
R using read.csv

If that does not satisfy you maybe the R Wizards on the list might be able
to point you to a
native R way of doing this possibly using scan? I'm not sure though.

Hope this helps,
Chillu

On Tue, Mar 2, 2010 at 9:42 PM, Marshall Feldman <marsh@uri.edu> wrote:
> Hello R wizards,
>
> What is the best way to read a data file containing both fixed-width and
> tab-delimited files? (More detail follows.)
>
> _*Details:*_
> The U.S. Bureau of Labor Statistics provides local area unemployment
> statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are
> documented in the file la.txt
> <ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has
five
> tab-delimited fields:
>
>    * series_id
>    * year
>    * period (codes for things like quarter or month of year)
>    * value
>    * footnote_codes
>
> The series_id consists of five fixed-width subfields (length in
> parentheses):
>
>    * survey abbreviation (2)
>    * seasonal code (1)
>    * area type code (2)
>    * area code (6)
>    * measure code (2)
>
> So an example record might be:
>
> LASPS36040003   1990    M01     8.8     L
>
> I want to read in the data in one pass and convert them to a data frame
> with the following columns (actual name, class in parentheses):
>
>    Survey abbreviation (survey, character)
>    Seasonal (seasonal, logical seasonal=T)
>    Area type (area_type_code, factor)
>    Area (area_code, factor)
>    Measure (measure_code, factor)
>    Year (year, Date)
>    Period (period, factor)
>    Value (value, numeric)
>    Footnote (footnote_codes, character but see note)
>
> (Regarding the Footnote, I have to look at the data more. If there's
> just one code per record, this will be a factor; if there are multiple,
> it will either be character or a list. For not I'm making it only
> character.)
>
> Currently I can read the data just fine using read.table, but this makes
> series_id the first variable. I want to break out the subfields as
> separate columns.
>
> Any suggestions?
>
> Thanks.
>     Marsh Feldman
>
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Mar 2010 - Reading data file with both fixed and tab-delimited fields

[R] Reading data file with both fixed and tab-delimited fields

[R] Reading data file with both fixed and tab-delimited fields

Apparently Analagous Threads