Thanks for following up on this, Frank. I regret that I didn't
respond to you and to the list sooner.
A bit of background - Saikat DebRoy and I developed R and C code for
the lookup.xport and read.xport functions for exactly the type of
application that you describe. We were working on a project where we
were to receive data in the SAS XPORT file format. As preparation for
that project we looked up the description of a SAS XPORT data set and
worked out code to read that format. All testing was done with data
sets generated under SAS version 6 because that is all we had access
to.
When we actually got the data it wasn't in XPORT format so we never
used those functions. That was several years ago and since then
neither Saikat nor I have used the lookup.xport or read.xport
functions at all. We do have access to SAS but Saikat never uses it
and I almost never use it. (Once or twice a year I run an elementary
SAS program to show a class how SAS is used but that is about it.)
The bottom line is that lookup.xport and read.xport were written by
people who don't use SAS, working from the (sometimes incorrect)
documentation on the SAS web site, and using an old version of SAS to
generate test cases.
Your request for someone to step forward to maintain and enhance these
functions is exactly what Saikat and I would like to have happen. In
the phrase used by Debian GNU/Linux maintainers, we would like to
"orphan" these functions. That is, we would like to put them up for
adoption by someone else.
If a volunteer will come forward, we will be happy to help such a
person understand our code and what we were trying to accomplish. It
is not easy to decode that format because the format itself is, shall
we say, "interesting". You point out that numeric variables of length
3 bytes cause problems. I'm not surprised - according to the
documentation that we had, such variables cannot exist. There is only
one numeric form allowed and that is the IBM System/360 double
precision format.
The really fun part of the format is that there is no information in
the headers about the number of rows in the data set. If you realize
that SAS was designed to read and write data on reels of magnetic
tape, this makes sense, but in today's environment it is bizarre. When
one data set ends the total length of the data set is padded with
blanks to a multiple of 80 bytes (so it can be conveniently
transferred to a deck of punched cards, naturally) then a magic header
sequence is written onto the next card image. This means you have to
read all data in 80 byte chunks and look ahead to the next chunk to
decide how to interpret the current chunk. It also means that the
data format is ambiguous. When records are, say, 40 bytes in length,
it is impossible to distinguish between n records where n is odd and
n+1 records where the last record happens to be all blanks. I spent
quite a bit of time thinking that I must have missed something in the
specification because this was such a glaring flaw. I looked around
the SAS web site and finally found some discussion of this. SAS
acknowledges the ambiguity and has a simple fix - "don't do that".
They specifically say that you should not put a record that is all
blanks at the end of a data set.
In any case I appreciate your following up on this and echo your
request for a programmer to step forward and adopt these functions.
Frank E Harrell Jr <fharrell at virginia.edu> writes:
> Even though the FDA has no policies at all that limit our choices of
> statistical software, there is one defacto standard in place:
> reliance of the SAS transport file format for data submission (even
> though this format is deficient for this purpose, e.g., it does not
> even document value labels or units of measurement in a
> self-contained way). Because of the widespread use of SAS transport
> files in the pharmaceutical industry, clinical trial data analyses
> done by statistical centers like ours who receive data from
> companies often begin with SAS transport files. I have not had SAS
> on my machines in about 12 years so it would be nice to be able to
> read binary transport files instead of having to run the slower
> sas.get function in the Hmisc library. sas.get has to launch SAS to
> do its work.
>
> The foreign package implements a quick way to read such files in its
> read.xport function. This function has some significant problems
> which I have reported to the developers some time ago but fixes do
> not seem to be forthcoming nor have acknowledgements of the bug
> report. The developers have done great work in writing the foreign
> package (and many other awesome contributions to the community) so I
> don't fault them at all for being creative, busy people. I am
> writing this note to see if any C language-savvy R users have done
> their own fixes or would be willing to help the developers with
> these particular fixes. The specific problems I have found are (1)
> a worrisome one in which reasonable but invalid data result from
> importing SAS numeric variables of length 3 bytes; and (2) getting
> corrupted files when the SAS transport file contains multiple SAS
> datasets. In addition, it would be great to have lookup.xport
> retrieve all SAS variable attributes including PROC FORMAT VALU! E
> names, so that factor variables could be created as is done
> automatically with read.spss in foreign. Note there is also a
> problem with lookup.xport when there are multiple files. The
> documentation states that a list with a major element for each
> dataset will be created. read.xport is supposed to create a list of
> data frames for this case.
>
> Here is SAS code I used to create test files, followed by R output.
>
> libname x SASV5XPT "test.xpt";
> libname y SASV5XPT "test2.xpt";
>
> PROC FORMAT; VALUE race 1=green 2=blue 3=purple; RUN;
> PROC FORMAT CNTLOUT=format;RUN;
> data test;
> LENGTH race 3 age 4;
> age=30; label age="Age at Beginning of Study";
> race=2;
> d1='3mar2002'd ;
> dt1='3mar2002 9:31:02'dt;
> t1='11:13:45't;
> output;
>
> age=31;
> race=4;
> d1='3jun2002'd ;
> dt1='3jun2002 9:42:07'dt;
> t1='11:14:13't;
> output;
> format d1 mmddyy10. dt1 datetime. t1 time. race race.;
> run;
> /* PROC CPORT LIB=work FILE='test.xpt';run; * no; */
> PROC COPY IN=work OUT=x;SELECT test;RUN;
> PROC COPY IN=work OUT=y;SELECT test format;RUN;
>
>
> > lookup.xport('test.xpt')
> $TEST
> $TEST$headpad
> [1] 1200
>
> $TEST$type
> [1] "numeric" "numeric" "numeric"
"numeric" "numeric"
>
> $TEST$width
> [1] 3 4 8 8 8
>
> $TEST$index
> [1] 1 2 3 4 5
>
> $TEST$position
> [1] 0 3 7 15 23
>
> $TEST$name
> [1] "RACE" "AGE" "D1" "DT1"
"T1"
>
> $TEST$sexptype
> [1] 14 14 14 14 14
>
> $TEST$tailpad
> [1] 18
>
> $TEST$length
> [1] 2
>
> > lookup.xport('test2.xpt')
>
> Same output except tailpad=76, length=124, second dataset ignored.
>
> > read.xport('test.xpt')
> RACE AGE D1 DT1 T1
> 1 2.000063 30.00000 15402 1330767062 40425
> 2 4.000063 31.00000 15494 1338716527 40453
>
> > read.xport('test2.xpt')
> RACE AGE D1 DT1 T1
> 1 2.000063e+00 3.000000e+01 1.540200e+04 1.330767e+09 4.042500e+04
> 2 4.000063e+00 3.100000e+01 1.549400e+04 1.338717e+09 4.045300e+04
> . . . .
> 122 3.687825e-40 3.687825e-40 3.687825e-40 5.868918e-40 3.687825e-40
> 123 5.904941e-40 2.942346e+63 9.068390e+43 NA -5.524256e-48
> 124 3.835229e-93 6.434447e-86 NA 3.687825e-40 3.687825e-40
>
>
> test.xpt and test2.xpt may be retrieved from
http://hesweb1.med.virginia.edu/biostat/tmp
>
> They were created on an IBM AIX machine running SAS 8.
>
> Thanks very much for any assistance. -Frank
>
> --
> Frank E Harrell Jr Prof. of Biostatistics & Statistics
> Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
> U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> http://www.stat.math.ethz.ch/mailman/listinfo/r-help
--
Douglas Bates bates at stat.wisc.edu
Statistics Department 608/262-2598
University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/