Douglas Bates
2010-Mar-25 20:54 UTC
[R] A file with extension .sdb in a codebook section of a large database from a survey?
The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems to provide "both kinds" of universal data formats - either SPSS saved data sets or SAS saved data sets. (Yes, I am being sarcastic.) These, of course, are accompanied by massive codebooks explaining the nature of each of the fields in the data sets. The T07_Codebooks.zip file available at that site contains .pdf files and .sdb files, which seem to contain the information from the codebooks in some kind of binary format. Does anyone know where that format is defined. I imagine I could reverse-engineer it but would prefer not to do so. I would like to use part of this dataset as an example of a very large hierarchically structured data set for analysis in lme4.
Marc Schwartz
2010-Mar-25 21:38 UTC
[R] A file with extension .sdb in a codebook section of a large database from a survey?
On Mar 25, 2010, at 3:54 PM, Douglas Bates wrote:> The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems > to provide "both kinds" of universal data formats - either SPSS saved > data sets or SAS saved data sets. (Yes, I am being sarcastic.) > These, of course, are accompanied by massive codebooks explaining the > nature of each of the fields in the data sets. The T07_Codebooks.zip > file available at that site contains .pdf files and .sdb files, which > seem to contain the information from the codebooks in some kind of > binary format. Does anyone know where that format is defined. I > imagine I could reverse-engineer it but would prefer not to do so. > > I would like to use part of this dataset as an example of a very large > hierarchically structured data set for analysis in lme4.Doug, According to the User Guide. bottom of page 110, they are "standard Dbase" files. I tried reading one of them with read.dbf() in 'foreign', however that did not work. It would seem that if you rename the extensions from .sdb to .dbf, then they can be read with read.dbf(): # rename ACGTMSM4.sbd to ACGTMSM4.dbf> str(read.dbf("ACGTMSM4.dbf"))'data.frame': 116 obs. of 28 variables: $ FIELD_NAME: Factor w/ 116 levels "AC4GAPAD","AC4GAPCH",..: 104 108 73 26 23 51 50 44 25 41 ... $ FIELD_TYPE: Factor w/ 2 levels "C","N": 2 2 2 2 1 1 1 1 2 2 ... $ FIELD_LEN : int 5 4 5 5 1 1 1 1 3 2 ... $ FIELD_DEC : int 0 0 0 0 0 0 0 0 0 0 ... $ FIELD_LABL: Factor w/ 116 levels "COUNTRY ID","EXPLICIT STRATUM CODE",..: 1 98 74 75 76 71 70 48 47 72 ... $ QUEST_LOC : Factor w/ 106 levels "COUNTRY","DATE",..: 1 8 9 10 11 12 13 14 15 16 ... $ MISSING : Factor w/ 6 levels "9","99","999",..: NA NA 4 4 1 1 1 1 3 2 ... $ NOTAPPL : Factor w/ 6 levels "8","98","998",..: NA NA 4 4 1 1 1 1 3 2 ... $ DEFAULT : Factor w/ 4 levels "7","97","997",..: NA NA 4 4 1 1 1 1 3 2 ... $ FIELD_VALI: Factor w/ 105 levels ".T.","(AC4GAPAD>=0.AND.AC4GAPAD<=97).OR.AC4GAPAD=999.OR.AC4GAPAD=998",..: 1 14 13 10 30 54 53 47 9 11 ... $ FIELD_CODE: Factor w/ 41 levels "0 TO 10 PERCENT:1;11 TO 25 PERCENT:2;26 TO 50 PERCENT:3;MORE THAN 50 PERCENT:4;omitted:9;not admin.:8;",..: 21 22 14 11 28 1 1 29 10 12 ... $ FIELD_EDIT: logi TRUE TRUE TRUE TRUE TRUE TRUE ... $ FIELD_CARR: logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ ORDER_SCRN: int 1 2 3 4 5 6 7 8 9 10 ... $ ORDER_FILE: int 1 2 3 4 5 6 7 8 9 10 ... $ COMMENT1 : Factor w/ 4 levels "Released in TIMSS 2003 as acdgpsc",..: NA NA NA NA NA NA NA NA NA NA ... $ MEAS_CLASS: Factor w/ 6 levels "B","BD","D","DERI",..: 6 6 1 1 1 1 1 1 1 1 ... $ IDBOOK : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... $ FMT : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... $ DUMMY : int 0 0 0 0 0 0 0 0 0 0 ... $ VALID_VAL : Factor w/ 7 levels ".T.","1;2;","1;2;3;",..: 1 NA NA NA 6 4 4 4 NA NA ... $ MIN_MAX : Factor w/ 8 levels "0;11000","0;1200",..: NA 6 1 2 NA NA NA NA 7 8 ... $ FILTER_VAR: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... $ FILTER_CND: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... $ CONFIRMED : logi NA NA NA NA NA NA ... $ SASPG1 : Factor w/ 6 levels "B","BD","DPC",..: 5 5 1 1 1 1 1 1 1 1 ... $ SASPG2 : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... $ SASPG3 : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ... - attr(*, "data_types")= chr "C" "C" "N" "N" ... However, there were warnings:> warnings()Warning messages: 1: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field 2: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field 3: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field ... The content of the above data frame does seem to correspond to the PDF file content. HTH, Marc Schwartz