thr3ads.net - R help - [R] A file with extension .sdb in a codebook section of a large database from a survey? [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Douglas Bates

2010-Mar-25 20:54 UTC

[R] A file with extension .sdb in a codebook section of a large database from a survey?

The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems
to provide "both kinds" of universal data formats - either SPSS saved
data sets or SAS saved data sets.  (Yes, I am being sarcastic.)
These, of course, are accompanied by massive codebooks explaining the
nature of each of the fields in the data sets.  The T07_Codebooks.zip
file available at that site contains .pdf files and .sdb files, which
seem to contain the information from the codebooks in some kind of
binary format.  Does anyone know where that format is defined.  I
imagine I could reverse-engineer it but would prefer not to do so.

I would like to use part of this dataset as an example of a very large
hierarchically structured data set for analysis in lme4.

Marc Schwartz

2010-Mar-25 21:38 UTC

head link

[R] A file with extension .sdb in a codebook section of a large database from a survey?

On Mar 25, 2010, at 3:54 PM, Douglas Bates wrote:
> The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems
> to provide "both kinds" of universal data formats - either SPSS
saved
> data sets or SAS saved data sets.  (Yes, I am being sarcastic.)
> These, of course, are accompanied by massive codebooks explaining the
> nature of each of the fields in the data sets.  The T07_Codebooks.zip
> file available at that site contains .pdf files and .sdb files, which
> seem to contain the information from the codebooks in some kind of
> binary format.  Does anyone know where that format is defined.  I
> imagine I could reverse-engineer it but would prefer not to do so.
> 
> I would like to use part of this dataset as an example of a very large
> hierarchically structured data set for analysis in lme4.

Doug,

According to the User Guide. bottom of page 110, they are "standard
Dbase" files.  I tried reading one of them with read.dbf() in
'foreign', however that did not work. It would seem that if you rename
the extensions from .sdb to .dbf, then they can be read with read.dbf():

# rename ACGTMSM4.sbd to ACGTMSM4.dbf
> str(read.dbf("ACGTMSM4.dbf"))'data.frame':	116 obs. of  28 variables:
 $ FIELD_NAME: Factor w/ 116 levels
"AC4GAPAD","AC4GAPCH",..: 104 108 73 26 23 51 50 44 25 41
...
 $ FIELD_TYPE: Factor w/ 2 levels "C","N": 2 2 2 2 1 1 1 1 2
2 ...
 $ FIELD_LEN : int  5 4 5 5 1 1 1 1 3 2 ...
 $ FIELD_DEC : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FIELD_LABL: Factor w/ 116 levels "COUNTRY ID","EXPLICIT
STRATUM CODE",..: 1 98 74 75 76 71 70 48 47 72 ...
 $ QUEST_LOC : Factor w/ 106 levels "COUNTRY","DATE",..: 1 8
9 10 11 12 13 14 15 16 ...
 $ MISSING   : Factor w/ 6 levels
"9","99","999",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ NOTAPPL   : Factor w/ 6 levels
"8","98","998",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ DEFAULT   : Factor w/ 4 levels
"7","97","997",..: NA NA 4 4 1 1 1 1 3 2 ...
 $ FIELD_VALI: Factor w/ 105 levels
".T.","(AC4GAPAD>=0.AND.AC4GAPAD<=97).OR.AC4GAPAD=999.OR.AC4GAPAD=998",..:
1 14 13 10 30 54 53 47 9 11 ...
 $ FIELD_CODE: Factor w/ 41 levels "0 TO 10 PERCENT:1;11 TO 25 PERCENT:2;26
TO 50 PERCENT:3;MORE THAN 50 PERCENT:4;omitted:9;not admin.:8;",..: 21 22
14 11 28 1 1 29 10 12 ...
 $ FIELD_EDIT: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ FIELD_CARR: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ORDER_SCRN: int  1 2 3 4 5 6 7 8 9 10 ...
 $ ORDER_FILE: int  1 2 3 4 5 6 7 8 9 10 ...
 $ COMMENT1  : Factor w/ 4 levels "Released in TIMSS 2003 as
acdgpsc",..: NA NA NA NA NA NA NA NA NA NA ...
 $ MEAS_CLASS: Factor w/ 6 levels
"B","BD","D","DERI",..: 6 6 1 1 1 1 1 1
1 1 ...
 $ IDBOOK    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ FMT       : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ DUMMY     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ VALID_VAL : Factor w/ 7 levels
".T.","1;2;","1;2;3;",..: 1 NA NA NA 6 4 4 4 NA NA
...
 $ MIN_MAX   : Factor w/ 8 levels "0;11000","0;1200",..: NA
6 1 2 NA NA NA NA 7 8 ...
 $ FILTER_VAR: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ FILTER_CND: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ CONFIRMED : logi  NA NA NA NA NA NA ...
 $ SASPG1    : Factor w/ 6 levels
"B","BD","DPC",..: 5 5 1 1 1 1 1 1 1 1 ...
 $ SASPG2    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 $ SASPG3    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "data_types")= chr  "C" "C"
"N" "N" ...


However, there were warnings:
> warnings()Warning messages:
1: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
2: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
3: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
...


The content of the above data frame does seem to correspond to the PDF file
content.

HTH,

Marc Schwartz

Reasonably Related Threads

Search for more possibly parallel threads

R help - Mar 2010 - A file with extension .sdb in a codebook section of a large database from a survey?

[R] A file with extension .sdb in a codebook section of a large database from a survey?

[R] A file with extension .sdb in a codebook section of a large database from a survey?

Reasonably Related Threads