I am contemplating bringing in and merging three NHANES-III datasets from the National Center for Health Statistics that are fixed format with record length=3348, line counts around 20,000 and described by SAS DATA steps. I have downloaded and linked similar datasets from the Continuous NHANES public data releases, but never ones with this many variables at once. In the prior effort I managed the task by some cut- paste-editing from the SAS code file into a corresponding read.fwf R call, but the earlier NHANES-III data is far more voluminous than the more recent "Continuous" version. I am wondering if anyone has experience with such a process and would be willing to share some advice? The SAS code can be seen here: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas The main code file Data step starts out... FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348; *** LRECL includes 2 positions for CRLF, assuming use of PC SAS; DATA WORK; INFILE ADULT MISSOVER; LENGTH SEQN 7 DMPFSEQ 5 DMPSTAT 3 DMARETHN 3 DMARACER 3 DMAETHNR 3 HSSEX 3 The corresponding positions in the INPUT section are INPUT SEQN 1-5 DMPFSEQ 6-10 DMPSTAT 11 DMARETHN 12 DMARACER 13 DMAETHNR 14 HSSEX 15 The note about CRLF appears to be implying that those characters are being counted as part of the length of the first variable, SEQN, but that there are only 5 meaningful positions. I suppose I can find out by trial and error how to read such files, but it would save me some time if anyone in the audience has worked through this on this data before. One thought would be to import the data with the SAS work-alike program, WKS, (which I have not used before) and then to read in with read.xport from the foreign library. That would obviate the need to understand the character position issue, but probably has a time commitment to get it up and running and learn how to use it. Another thought would be to parse the fixed width SAS Data step code into pieces and build a data.frame from which I then extract the row.names, col.names, and colClasses from that centralized structure. David Winsemius, MD Heritage Laboratories West Hartford, CT
On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius <dwinsemius at comcast.net> wrote:> I am contemplating bringing in and merging three NHANES-III datasets from > the National Center for Health Statistics that are fixed format with record > length=3348, line counts around 20,000 and described by SAS DATA steps. I > have downloaded and linked similar datasets from the Continuous NHANES > public data releases, but never ones with this many variables at once. In > the prior effort I managed the task by some cut-paste-editing from the SAS > code file into a corresponding read.fwf R call, but the earlier NHANES-III > data is far more voluminous than the more recent "Continuous" version. I am > wondering if anyone has experience with such a process and would be willing > to share some advice? The SAS code can be seen here:> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas> The main code file Data step starts out... > ? ?FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348; > ? ?*** LRECL includes 2 positions for CRLF, assuming use of PC SAS; > ? ?DATA WORK; > ? ? ?INFILE ADULT MISSOVER; > ? ? ?LENGTH > ? ? ? ?SEQN ? ? ?7 > ? ? ? ?DMPFSEQ ? 5 > ? ? ? ?DMPSTAT ? 3 > ? ? ? ?DMARETHN ?3 > ? ? ? ?DMARACER ?3 > ? ? ? ?DMAETHNR ?3 > ? ? ? ?HSSEX ? ? 3 > The corresponding positions in the INPUT section are > ? ? INPUT > ? ? ? ?SEQN ? ? 1-5 > ? ? ? ?DMPFSEQ ?6-10 > ? ? ? ?DMPSTAT ?11 > ? ? ? ?DMARETHN 12 > ? ? ? ?DMARACER 13 > ? ? ? ?DMAETHNR 14 > ? ? ? ?HSSEX ? ?15 > The note about CRLF appears to be implying that those characters are being > counted as part of the length of the first variable, SEQN, but that there > are only 5 meaningful positions. I suppose I can find out by trial and error > how to read such files, but it would save me some time if anyone in the > audience has worked through this on this data before. > One thought would be to import the data with the SAS work-alike program, > WKS, (which I have not used before) and then to read in with read.xport from > the foreign library. That would obviate the need to understand the character > position issue, but probably has a time commitment to get it up and running > and learn how to use it. > Another thought would be to parse the fixed width SAS Data step code into > pieces and build a data.frame from which I then extract the row.names, > col.names, and colClasses from that centralized structure.Are the data available to the public somewhere or could just a few records be made available? The reason I ask is because I imagine there are a lot of missing data in each record (the data are arranged in the "wide" format for longitudinal data and includes follow-up questions that will not apply to most respondents). The missing data indicator, if any, and the format of the other fields will be important in deciding how to split the data.
On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius > <dwinsemius at comcast.net> wrote: >> I am contemplating bringing in and merging three NHANES-III >> datasets from >> the National Center for Health Statistics that are fixed format >> with record >> length=3348, line counts around 20,000 and described by SAS DATA >> steps. I >> have downloaded and linked similar datasets from the Continuous >> NHANES >> public data releases, but never ones with this many variables at >> once. In >> the prior effort I managed the task by some cut-paste-editing from >> the SAS >> code file into a corresponding read.fwf R call, but the earlier >> NHANES-III >> data is far more voluminous than the more recent "Continuous" >> version. I am >> wondering if anyone has experience with such a process and would be >> willing >> to share some advice? The SAS code can be seen here: > >> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas > >> The main code file Data step starts out... >> FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348; >> *** LRECL includes 2 positions for CRLF, assuming use of PC SAS; >> DATA WORK; >> INFILE ADULT MISSOVER; >> LENGTH >> SEQN 7 >> DMPFSEQ 5 >> DMPSTAT 3 >> DMARETHN 3 >> DMARACER 3 >> DMAETHNR 3 >> HSSEX 3 >> The corresponding positions in the INPUT section are >> INPUT >> SEQN 1-5 >> DMPFSEQ 6-10 >> DMPSTAT 11 >> DMARETHN 12 >> DMARACER 13 >> DMAETHNR 14 >> HSSEX 15 >> The note about CRLF appears to be implying that those characters >> are being >> counted as part of the length of the first variable, SEQN, but that >> there >> are only 5 meaningful positions. I suppose I can find out by trial >> and error >> how to read such files, but it would save me some time if anyone in >> the >> audience has worked through this on this data before. >> One thought would be to import the data with the SAS work-alike >> program, >> WKS, (which I have not used before) and then to read in with >> read.xport from >> the foreign library. That would obviate the need to understand the >> character >> position issue, but probably has a time commitment to get it up and >> running >> and learn how to use it. >> Another thought would be to parse the fixed width SAS Data step >> code into >> pieces and build a data.frame from which I then extract the >> row.names, >> col.names, and colClasses from that centralized structure. > > Are the data available to the public somewhere or could just a few > records be made available?Yes. Just trim the file name and the CDC ftp server accepts the path specification: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/ The file that goes with that SAS code is adult.dat ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat> > The reason I ask is because I imagine there are a lot of missing data > in each record (the data are arranged in the "wide" format for > longitudinal data and includes follow-up questions that will not apply > to most respondents). The missing data indicator, if any, and the > format of the other fields will be important in deciding how to split > the data.Thanks for that. It was not designed as a longitudinal study, but rather as cross-sectional study that was spaced over several years. They did a re-exam of some sort, but that was not the primary purpose, nor will it be my particular interest. I have tried to determine by examination whether "." or " " is the missing value indicator and it appears that both may used although there are many more spaces. Most of the input suggests to my 15-year-old memories of SAS that the data is numeric but there are 17 variables where input spec is "$nn" > varLines[grep("[[:punct:]]", varLines)] [1] " HAX11AG $6" " HAX11AH $6" " HAX11AI $6" [4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL $6" [7] " HAX11AM $6" " HAX11AN $6" " HAX11AO $6" [10] " HAX11AP $6" " HAX11AQ $6" " HAX11AR $6" [13] " HAX11AS $6" " HAX11AT $6" " HAX11AU $6" [16] " HAX11AV $6" " HAZA1CC $30" -- David Winsemius, MD Heritage Laboratories West Hartford, CT