Saikat DebRoy and I have been working on an R package that, among
other things, will read SAS data libraries in the XPORT format. Even
though a SAS data set is of a fairly simple structure, it is a
challenge to write code to read the libraries "properly". In fact, I
am beginning to think that file format is not well-defined.
Here is the situation:
- a SAS data set is a table. The columns can be numeric (in the
XPORT format the only numeric format allowed is the IBM mainframe
double precision format) or character strings. In a character
column, all the strings are blank-padded to the same length and
that length is included in the header information. There is no
terminator character for strings. There is no record terminator
character.
- a SAS library file can contain more than one table.
- the header information for each table includes the number of
columns and the format of each column but does _not_ include the
number of rows. (Why not? Remember that SAS was developed at a
time when any large amount of data was stored on punched cards or
magnetic tape. SAS functions as a data filter, for the most
part. You don't know how many rows you have until you get to the
end and then you can't go back and change the beginning because of
the way magnetic tape drives work. Of course, the relevance of
these considerations to computing resources in the year 2000 is
questionable.)
So how do you know when you have reached the end of one data set and
started another? SAS always works in blocks of 80 bytes. If the last
record in a table does not completely fill an 80 byte block, the
remainder of the block is blank-padded. The next table will begin
with 80 bytes that must be exactly
"HEADER RECORD*******MEMBER HEADER
RECORD!!!!!!!000000000000000001600000000140 "
except on certain VAX/VMS computers where the 140 at the end is 136.
I ran into a situation where the data were 484 rows of 8 numeric
columns. The total length of each record is 64 bytes so the data
proper occupies 30976 bytes, not counting the headers. This is a
total of 387 complete 80 byte blocks with 16 bytes left over. That
last block is padded with 64 blanks.
So how do we know that these 64 blanks are not another data record?
In the case of numeric data, the particular number corresponding to 8
blanks (using the IBM mainframe floating point format, not the IEEE
format) is
> Pheno[745,]
INDIV TIME DOSE WEIGHT CONC
745 3.687825e-40 3.687825e-40 3.687825e-40 3.687825e-40 3.687825e-40
NEWSUB APGARLOW TLAG
745 3.687825e-40 3.687825e-40 3.687825e-40
We can probably employ some heuristics and say that this is not a
common value so we guess that the 64 blanks are padding and not
another data record.
But what if they had been 8 character fields, each of width 8 bytes?
Is it possible to distinguish between a blank record and blank
padding? I can't see that it is possible. Is there some restriction
in SAS that says you can't have a record in which all the fields are
blank?
One could pass this off as of little interest to the R community
except that this format is now a "standard" that has been adopted by
the United States Food and Drug Administration (FDA). See
http://www.sas.com/software/industry/pht/fda/index.html
--
Douglas Bates bates@stat.wisc.edu
Statistics Department 608/262-2598
University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._