Often it is useful to keep a "codebook" to document the contents of a dataset. (By "dataset" I mean a rectangular structure such as a dataframe.) The codebook has as many rows as the dataset has columns (variables, fields). The columns (fields) of the codebook may include: ? variable name ? type (character, factor, integer, etc) ? variable label (e.g., a variable called "bmi2" might be labeled "BMI hand-input by clinic personnel, must be checked" ? permissible values ? which values indicate missing (and potentially different kinds of missing) Some statistics software (e.g., SPSS and Stata) provides at least a subset of this kind of information automatically in a convenient form. For instance, in Stata one can define a "label" for a variable and it is thenceforth linked to the variable. In output from certain modeling and graphics functions, Stata by default uses the label rather than the variable name. Furthemore: In Stata, if "myvariable" is labeled numeric (in R lingo, a factor), and I type codebook myvariable then Stata tells me, among other things, the "levels" of myvariable. Does a tool of this sort exist in R? The prompt() function is related to this, but prompt(someDataFrame) creates a text file on disk. The text file is associated with, but not unambiguously linked to, someDataFrame. The epicalc function codebook() provides a summary of a dataframe similar to that created by summary() but easier to read. But this is not a way to define and keep track of labels that are linked to variables. To link a dataframe to its codebook, one could do the following "by hand": Create a list, say, "somedata", where somedata$DATA is a dataframe that contains the data, and somedata$VARIABLE is also a dataframe, but serves as the codebook. For instance, the following function creates a template into which one could subsequently edit to insert variable labels and turn into somedata$VARIABLE. fnJunk <-function( THESEDATA ) { # From a dataframe, make the start of a codebook. if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)") data.frame( Variable=names(THESEDATA) , class=sapply(THESEDATA, class) , type=sapply(THESEDATA, typeof) , label="" , comment="" ) } But the following automatic behavior would be nice: ? We should be able to treat somedata exactly as we treat a dataframe, so that the fact that it possesses a "codebook" is merely an added benefit, not an interference with the usual tasks. ? If we delete a column of somedata$DATA, the associated row of somedata$VARIABLE should be automatically deleted. ? If we add a column to somedata$DATA, the associated column should be inserted in somedata$VARIABLE, and some of the fields automatically populated such as variable name and type. It could get fancier. For instance: ? If we try to add a value to a field in somedata$DATA which is not permitted by the "permissible values" listed for this field in somedata$VARIABLE, we get an error. Has anyone already thought this through, maybe defined a class and associated methods? Thanks Jacob A. Wegelin Assistant Professor Department of Biostatistics Virginia Commonwealth University 730 East Broad Street Room 3006 P. O. Box 980032 Richmond VA 23298-0032 U.S.A. E-mail: jwegelin at vcu.edu URL: http://www.people.vcu.edu/~jwegelin
I find that Harrell's describe ( Hmisc) provides some of that desired functionality. When I am creating a paper codebook I will print the results of describe function fro a dataframe to create an overview snapshot and will post a copy of str(dfname) on the wall. As his help page says: "describe is especially useful for describing data frames created by *.get, as labels, formats, value labels, and (in the case of sas.get) frequencies of special missing values are printed." I believe that Frank has developed some functions to replicate SAS's subtyping of NA values, although I have not explored such facilities. I also find that summary(dfname) provides some useful information that describe does not. -- David. On Oct 28, 2009, at 1:27 PM, Jacob Wegelin wrote:> > Often it is useful to keep a "codebook" to document the contents of > a dataset. (By "dataset" I mean > a rectangular structure such as a dataframe.) > > The codebook has as many rows as the dataset has columns (variables, > fields). The columns (fields) > of the codebook may include: > > ? variable name > > ? type (character, factor, integer, etc) > > ? variable label (e.g., a variable called "bmi2" might be > labeled "BMI hand-input by > clinic personnel, must be checked" > > ? permissible values > > ? which values indicate missing (and potentially different > kinds of missing) > > Some statistics software (e.g., SPSS and Stata) provides at least a > subset of this kind of > information automatically in a convenient form. For instance, in > Stata one can define a "label" for > a variable and it is thenceforth linked to the variable. In output > from certain modeling and > graphics functions, Stata by default uses the label rather than the > variable name. > > Furthemore: In Stata, if "myvariable" is labeled numeric (in R > lingo, a factor), and I type > > codebook myvariable > > then Stata tells me, among other things, the "levels" of myvariable. > > Does a tool of this sort exist in R? > > The prompt() function is related to this, but prompt(someDataFrame) > creates a text file on disk. The > text file is associated with, but not unambiguously linked to, > someDataFrame. > > The epicalc function codebook() provides a summary of a dataframe > similar to that created by > summary() but easier to read. But this is not a way to define and > keep track of labels that are > linked to variables. > > To link a dataframe to its codebook, one could do the following "by > hand": Create a list, say, > "somedata", where somedata$DATA is a dataframe that contains the > data, and somedata$VARIABLE is also > a dataframe, but serves as the codebook. For instance, the following > function creates a template > into which one could subsequently edit to insert variable labels and > turn into somedata$VARIABLE. > > fnJunk <-function( THESEDATA ) { > # From a dataframe, make the start of a codebook. > if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)") > data.frame( > Variable=names(THESEDATA) > , class=sapply(THESEDATA, class) > , type=sapply(THESEDATA, typeof) > , label="" > , comment="" > ) > } > > > But the following automatic behavior would be nice: > > ? We should be able to treat somedata exactly as we treat a > dataframe, so that the > fact that it possesses a "codebook" is merely an added benefit, not > an interference with the > usual tasks. > > ? If we delete a column of somedata$DATA, the associated row > of somedata$VARIABLE > should be automatically deleted. > > ? If we add a column to somedata$DATA, the associated column > should be inserted in > somedata$VARIABLE, and some of the fields automatically populated > such as variable name and > type. It could get fancier. For instance: > > ? If we try to add a value to a field in somedata$DATA which > is not permitted by the > "permissible values" listed for this field in somedata$VARIABLE, we > get an error. > > Has anyone already thought this through, maybe defined a class and > associated methods? > > Thanks > > Jacob A. Wegelin > Assistant Professor > Department of Biostatistics > Virginia Commonwealth University > 730 East Broad Street Room 3006 > P. O. Box 980032 > Richmond VA 23298-0032 > U.S.A. E-mail: jwegelin at vcu.edu URL: http://www.people.vcu.edu/~jwegelin______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Heritage Laboratories West Hartford, CT
Alzola and Harrell discuss some of these issues in "An introduction to S and the Hmisc and Design Libraries". -ista On Wed, Oct 28, 2009 at 1:27 PM, Jacob Wegelin <jacobwegelin at fastmail.fm> wrote:> > Often it is useful to keep a "codebook" to document the contents of a > dataset. (By "dataset" I mean > a rectangular structure such as a dataframe.) > > The codebook has as many rows as the dataset has columns (variables, > fields). ?The columns (fields) > of the codebook may include: > > ? ? ? ?? ? ? ? variable name > > ? ? ? ?? ? ? ? type (character, factor, integer, etc) > > ? ? ? ?? ? ? ? variable label (e.g., a variable called "bmi2" might be > labeled "BMI hand-input by > ? ? ? ?clinic personnel, must be checked" > > ? ? ? ?? ? ? ? permissible values > > ? ? ? ?? ? ? ? which values indicate missing (and potentially different > kinds of missing) > > Some statistics software (e.g., SPSS and Stata) provides at least a subset > of this kind of > information automatically in a convenient form. For instance, in Stata one > can define a "label" for > a variable and it is thenceforth linked to the variable. In output from > certain modeling and > graphics functions, Stata by default uses the label rather than the variable > name. > > Furthemore: In Stata, if "myvariable" is labeled numeric (in R lingo, a > factor), and I type > > codebook myvariable > > then Stata tells me, among other things, the "levels" of myvariable. > > Does a tool of this sort exist in R? > > The prompt() function is related to this, but prompt(someDataFrame) creates > a text file on disk. The > text file is associated with, but not unambiguously linked to, > someDataFrame. > > The epicalc function codebook() provides a summary of a dataframe similar to > that created by > summary() but easier to read. But this is not a way to define and keep track > of labels that are > linked to variables. > > To link a dataframe to its codebook, one could do the following "by hand": > Create a list, say, > "somedata", where somedata$DATA is a dataframe that contains the data, and > somedata$VARIABLE is also > a dataframe, but serves as the codebook. For instance, the following > function creates a template > into which one could subsequently edit to insert variable labels and turn > into somedata$VARIABLE. > > fnJunk <-function( THESEDATA ) { > # ?From a dataframe, make the start of a codebook. > ? if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)") > ? data.frame( > ? ? ?Variable=names(THESEDATA) > ? ? ?, class=sapply(THESEDATA, class) > ? ? ?, type=sapply(THESEDATA, typeof) > ? ? ?, label="" > ? ? ?, comment="" > ? ? ?) > } > > > But the following automatic behavior would be nice: > > ? ? ? ?? ? ? ? We should be able to treat somedata exactly as we treat a > dataframe, so that the > ? ? ? ?fact that it possesses a "codebook" is merely an added benefit, not > an interference with the > ? ? ? ?usual tasks. > > ? ? ? ?? ? ? ? If we delete a column of somedata$DATA, the associated row of > somedata$VARIABLE > ? ? ? ?should be automatically deleted. > > ? ? ? ?? ? ? ? If we add a column to somedata$DATA, the associated column > should be inserted in > ? ? ? ?somedata$VARIABLE, and some of the fields automatically populated > such as variable name and > ? ? ? ?type. ?It could get fancier. For instance: > > ? ? ? ?? ? ? ? If we try to add a value to a field in somedata$DATA which is > not permitted by the > ? ? ? ?"permissible values" listed for this field in somedata$VARIABLE, we > get an error. > > Has anyone already thought this through, maybe defined a class and > associated methods? > > Thanks > > Jacob A. Wegelin > Assistant Professor > Department of Biostatistics > Virginia Commonwealth University > 730 East Broad Street Room 3006 > P. O. Box 980032 > Richmond VA 23298-0032 > U.S.A. E-mail: jwegelin at vcu.edu URL: http://www.people.vcu.edu/~jwegelin > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org
As does Muenchen in RforSASSPSSusers.pdf and in the book that grew out of that effort: http://rforsasandspssusers.googlepages.com/RforSASSPSSusers.pdf http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1217456813&sr=8-1 http://rforsasandspssusers.com/ Also see QuickR http://www.statmethods.net/input/variablelables.html On Oct 28, 2009, at 2:14 PM, Ista Zahn wrote:> Alzola and Harrell discuss some of these issues in "An introduction to > S and the Hmisc and Design Libraries". > > -ista > > On Wed, Oct 28, 2009 at 1:27 PM, Jacob Wegelin <jacobwegelin at fastmail.fm > > wrote: >> >> Often it is useful to keep a "codebook" to document the contents of a >> dataset. (By "dataset" I mean >> a rectangular structure such as a dataframe.) >> >> The codebook has as many rows as the dataset has columns (variables, >> fields). The columns (fields) >> of the codebook may include: >> >> ? variable name >> >> ? type (character, factor, integer, etc) >> >> ? variable label (e.g., a variable called "bmi2" might >> be >> labeled "BMI hand-input by >> clinic personnel, must be checked" >> >> ? permissible values >> >> ? which values indicate missing (and potentially >> different >> kinds of missing) >> >> Some statistics software (e.g., SPSS and Stata) provides at least a >> subset >> of this kind of >> information automatically in a convenient form. For instance, in >> Stata one >> can define a "label" for >> a variable and it is thenceforth linked to the variable. In output >> from >> certain modeling and >> graphics functions, Stata by default uses the label rather than the >> variable >> name. >> >> Furthemore: In Stata, if "myvariable" is labeled numeric (in R >> lingo, a >> factor), and I type >> >> codebook myvariable >> >> then Stata tells me, among other things, the "levels" of myvariable. >> >> Does a tool of this sort exist in R? >> >> The prompt() function is related to this, but prompt(someDataFrame) >> creates >> a text file on disk. The >> text file is associated with, but not unambiguously linked to, >> someDataFrame. >> >> The epicalc function codebook() provides a summary of a dataframe >> similar to >> that created by >> summary() but easier to read. But this is not a way to define and >> keep track >> of labels that are >> linked to variables. >> >> To link a dataframe to its codebook, one could do the following "by >> hand": >> Create a list, say, >> "somedata", where somedata$DATA is a dataframe that contains the >> data, and >> somedata$VARIABLE is also >> a dataframe, but serves as the codebook. For instance, the following >> function creates a template >> into which one could subsequently edit to insert variable labels >> and turn >> into somedata$VARIABLE. >> >> fnJunk <-function( THESEDATA ) { >> # From a dataframe, make the start of a codebook. >> if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)") >> data.frame( >> Variable=names(THESEDATA) >> , class=sapply(THESEDATA, class) >> , type=sapply(THESEDATA, typeof) >> , label="" >> , comment="" >> ) >> } >> >> >> But the following automatic behavior would be nice: >> >> ? We should be able to treat somedata exactly as we >> treat a >> dataframe, so that the >> fact that it possesses a "codebook" is merely an added >> benefit, not >> an interference with the >> usual tasks. >> >> ? If we delete a column of somedata$DATA, the >> associated row of >> somedata$VARIABLE >> should be automatically deleted. >> >> ? If we add a column to somedata$DATA, the associated >> column >> should be inserted in >> somedata$VARIABLE, and some of the fields automatically >> populated >> such as variable name and >> type. It could get fancier. For instance: >> >> ? If we try to add a value to a field in somedata$DATA >> which is >> not permitted by the >> "permissible values" listed for this field in somedata >> $VARIABLE, we >> get an error. >> >> Has anyone already thought this through, maybe defined a class and >> associated methods? >> >> Thanks >> >> Jacob A. Wegelin >> Assistant Professor >> Department of Biostatistics >> Virginia Commonwealth University >> 730 East Broad Street Room 3006 >> P. O. Box 980032 >> Richmond VA 23298-0032 >> U.S.A. E-mail: jwegelin at vcu.edu URL: http://www.people.vcu.edu/~jwegelin >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > > > -- > Ista Zahn > Graduate student > University of Rochester > Department of Clinical and Social Psychology > http://yourpsyche.org > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Heritage Laboratories West Hartford, CT