Dear R, I import data from spss into a R data.frame. On this rawdata I do some data processing (selection of observations, normalization, recoding of variables etc..). The result is stored in a new data.frame, however, in this new data.frame the value labels are lost. Example of what I do in code: # read raw data from spss rawdata <- read.spss("./data/T50937.SAV", use.value.labels=FALSE,to.data.frame=TRUE) # select the observations that we need diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 | rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 | rawdata$D22==24 | rawdata$D22==33,] The result is that rawdata$D22 has value labels and that diarydata$D22 is numeric without value labels. Question: How can I prevent this from happening? Thanks in advance! Groeten, Arne
Marc Schwartz (via MN)
2006-Jul-12 18:14 UTC
[R] Keep value lables with data frame manipulation
On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:> Dear R, > > I import data from spss into a R data.frame. On this rawdata I do some > data processing (selection of observations, normalization, recoding of > variables etc..). The result is stored in a new data.frame, however, in > this new data.frame the value labels are lost. > > Example of what I do in code: > > # read raw data from spss > rawdata <- read.spss("./data/T50937.SAV", > use.value.labels=FALSE,to.data.frame=TRUE) > > # select the observations that we need > diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 | > rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 | > rawdata$D22==24 | rawdata$D22==33,] > > The result is that rawdata$D22 has value labels and that diarydata$D22 > is numeric without value labels. > > Question: How can I prevent this from happening? > > Thanks in advance! > Groeten, > ArneTwo things: 1. With respect to your subsetting, your lengthy code can be replaced with the following: diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33)) See ?subset and ?"%in%" for more information. 2. With respect to keeping the label related attributes, the 'value.labels' attribute and the 'variable.labels' attribute will not by default survive the use of "[".data.frame in R (see ?Extract and ?"[.data.frame"). On the other hand, based upon my review of ?read.spss, the SPSS value labels should be converted to the factor levels of the respective columns when 'use.value.labels = TRUE' and these would survive a subsetting. If you want to consider a solution to the attribute subsetting issue, you might want to review the following post by Gabor Grothendieck in May, which provides a possible solution: https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html and this post by me, for an explanation of what is happening in Gabor's solution: https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html HTH, Marc Schwartz
At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:>On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote: >> Dear R, >> >> I import data from spss into a R data.frame. On this rawdata I do some >> data processing (selection of observations, normalization, recoding of >> variables etc..). The result is stored in a new data.frame, however, in >> this new data.frame the value labels are lost. >> >> Example of what I do in code: >> >> # read raw data from spss >> rawdata <- read.spss("./data/T50937.SAV", >> use.value.labels=FALSE,to.data.frame=TRUE) >> >> # select the observations that we need >> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 | >> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 | >> rawdata$D22==24 | rawdata$D22==33,] >> >> The result is that rawdata$D22 has value labels and that diarydata$D22 >> is numeric without value labels. >> >> Question: How can I prevent this from happening? >> >> Thanks in advance! >> Groeten, >> Arne > >Two things: > >1. With respect to your subsetting, your lengthy code can be replaced >with the following: > > diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33)) > >See ?subset and ?"%in%" for more information. > > >2. With respect to keeping the label related attributes, the >'value.labels' attribute and the 'variable.labels' attribute will not by >default survive the use of "[".data.frame in R (see ?Extract >and ?"[.data.frame"). > >On the other hand, based upon my review of ?read.spss, the SPSS value >labels should be converted to the factor levels of the respective >columns when 'use.value.labels = TRUE' and these would survive a >subsetting. > >If you want to consider a solution to the attribute subsetting issue, >you might want to review the following post by Gabor Grothendieck in >May, which provides a possible solution: > > https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html > >and this post by me, for an explanation of what is happening in Gabor's >solution: > > https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html > >HTH, > >Marc Schwartz >Hello Mark and Arne, I worked on the suggestions of Gabor and Mark and programmed some functions in this way, but they are very, very preliminary (see below). In my view there is a lack of convenient possibilities in R to document empirical data by variable labels, value labels, etc. I would prefer to have these possibilities in the "standard" configuration. So I sketched a concept, but in my view it would only be useful, if there was some acceptance by the core developers of R. The concept would be to define a class. For now I call it "source.data". To design it more flexible than the Hmisc class "labelled" I would define a related option "source.data.attributes" with default c('value.labels', 'variable.name', 'label')). This option contains all attributes that should persist in subsetting/indexing. I made only some very, very preliminary tests with these functions, mainly because I am not happy with defining a new class. Instead I would prefer, if this functionality could be integrated in the Hmisc class "labelled", since this is in my view the best known starting point for data documentation in R. I would be happy, if there were some discussion about the wishes/needs of other Rusers concerning data documentation. Greetings, Heinz ### intention and concept # There should be a convenient possibility to keep source data numerical # coded and at the same time have labelled categories. # Such labelled categorical numerical data should be easily converted # to factors. # Indexing/subsetting should preserve the concerned attributes of this data. ### description of (intended!!!) functionality # - a class source.data is defined. It is intended only for atomic objects. # - option source.data.attributes defines which attributes will be copied # in indexing/subsetting objects of class source.data # - option source.data.is.ordered sets defining factors as ordered, when # built from objects of class source.data by the function factsd # - function 'value.labels<-' assigns an attribute value.labels and sets # class source.data # - function value.labels reads the attribute value.labels # - the indexing method '[.source.data' defines indexing for source.data # - the print method print.source.data ignores source.data.attributes in # printing # - the as.data.frame method as.data.frame.source.data enables inclusion # of objects of class source.data in data.frames # - function factsd should in general behave as function factor but should # in case of an object of class source.data by default use the value.labels # as levels and the names(value.labels) as the labels of the new built # factor. # If the parameter ordered is NULL it should create ordered factors # according to the option source.data.is.ordered. ### set option for source.data.attributes options(source.data.attributes=c('value.labels', 'variable.name', 'label')) ### set option for converting source.data class in ordered factors options(source.data.is.ordered=TRUE) ### function to assign value.labels 'value.labels<-' <- function (x, value) ## adapted from Hmisc function label 30.6.2006 { if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects only') structure(x, value.labels = value, class = c("source.data", attr(x, "class")[attr(x, "class") != "source.data"])) } ### function to read value.labels value.labels <- function (x) { attr(x, 'value.labels') } ### definition of indexing method for class=source.data ## source.data.attributes shall be conserved "[.source.data" <- function(x, ...) { atr <- attributes(x) atr.names <- names(atr) sda <- options()$'source.data.attributes' sda.match <- match(atr.names, sda) sda.match <- sda.match[!is.na(sda.match)] x <- NextMethod("[") ## assign source.data.attributes to result if(length(sda.match)) for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]] ## assign class source.data to result class(x) <- c('source.data', attr(x, "class")[attr(x, "class") != "source.data"]) x } ### print method for source.data 'print.source.data' <- function (x, ...) { ## adapted from Hmisc print.labelled 31.5.2006 x.orig <- x ## look if there are source.data.attributes sda <- options()$'source.data.attributes' sda.match <- match(names(attributes(x)), sda) sda.match <- sda.match[!is.na(sda.match)] ## delete source.data.attributes for printing if(length(sda.match)) for (i in sda.match) attr(x, sda[i]) <- NULL ## delete class source.data for printing class(x) <- if (length(class(x)) == 1 && class(x) == "source.data") NULL else class(x)[class(x) != "source.data"] NextMethod("print") invisible(x.orig) } ### Define function as.data.frame.source.data (copy from as.data.frame.vector) # many as.data.frame methods are identical to this ## different functions as.data.frame are besides others: # as.data.frame.list, as.data.frame.default, as.data.frame.data.frame, # as.data.frame.character, as.data.frame.AsIs, as.data.frame.array, as.data.frame.source.data <- function (x, row.names = NULL, optional = FALSE) ## copy from as.data.frame.vector 1.6.2006 { nrows <- length(x) nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = " ") if (is.null(row.names)) { if (nrows == 0) row.names <- character(0) else if (length(row.names <- names(x)) == nrows && !any(duplicated(row.names))) { } else if (optional) row.names <- character(nrows) else row.names <- as.character(1:nrows) } names(x) <- NULL value <- list(x) if (!optional) names(value) <- nm attr(value, "row.names") <- row.names class(value) <- "data.frame" value } ### function to create factor from source.data class applying variable.labels # and copying all source.data.attributes # remark: factor(factsd(x)) drops unused factor levels and source.data class # factsd(x)[, drop=TRUE] drops unused factor levels but keeps # source.data class and attributes factsd <- function(x = character(), levels = sort(unique.default(x), na.last = TRUE), labels = levels, exclude = NA, ordered = NULL) { ## check if is of class source.data if ('source.data' %in% class(x)) { if(is.null(ordered)) ordered <- options()$source.data.is.ordered fx <- factor(x = x, levels = value.labels(x), labels = names(value.labels(x)), exclude = exclude, ordered = ordered) ## copy source.data.attributes atr <- attributes(x) atr.names <- names(atr) sda <- options()$'source.data.attributes' sda.match <- match(atr.names, sda) sda.match <- sda.match[!is.na(sda.match)] ## assign source.data.attributes to result if(length(sda.match)) for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]] ## add class source.data to result class(fx) <- c('source.data', attr(fx, 'class')) } else { if(is.null(ordered)) ordered <- is.ordered(x) fx <- factor(x = x, levels = levels, labels = labels, exclude = exclude, ordered = ordered) } fx }
Richard M. Heiberger
2006-Jul-13 16:36 UTC
[R] Keep value lables with data frame manipulation
> Further I do not see a simple method to label numerical > variables. I often encounter discrete, but still metric data, as e.g. risk > scores. Usually it would be nice to use them in their original coding, > which may include zero or decimal places and to label them at the same time.## For this specific case, I use a "position" attribute. tmp <- data.frame(y=rnorm(30), x=factor(rep(c(0,1,2,4,8), 6))) attr(tmp$x, "position") <- as.numeric(as.character(tmp$x)) tmp as.numeric(tmp$x) attr(tmp$x, "position") bwplot(y ~ x, data=tmp) panel.bwplot.position <- function(x, y, ..., x.at) { for (x.i in x.at) { y.i <- y[x.i==x] panel.bwplot(rep(x.i, length(y.i)), y.i, ...) } } bwplot.position <- function(formula, data, ..., x.at) { if (missing(x.at)) { x.name <- dimnames(attr(terms(formula),"factors"))[[2]] x.at <- attr(data[[x.name]], "position") } bwplot(formula, data, ..., x.at=x.at, panel=panel.bwplot.position, scales=list(x=list(at=x.at, limits=x.at+c(-1,1)))) } bwplot.position(y ~ x, data=tmp) ## The above is a simplified version of ## panel.bwplot.intermediate.hh ## in the online files for ## Statistical Analysis and Data Display ## Richard M. Heiberger and Burt Holland ## http://springeronline.com/0-387-40270-5 ## ## An example of a boxplot with both placement and color of the boxes ## under user control is in ## ## http://astro.ocis.temple.edu/~rmh/HH/bwplot-color.pdf
At 12:36 13.07.2006 -0400, Richard M. Heiberger wrote:>> Further I do not see a simple method to label numerical >> variables. I often encounter discrete, but still metric data, as e.g. risk >> scores. Usually it would be nice to use them in their original coding, >> which may include zero or decimal places and to label them at the sametime.> >## For this specific case, I use a "position" attribute. > > >tmp <- data.frame(y=rnorm(30), x=factor(rep(c(0,1,2,4,8), 6))) >attr(tmp$x, "position") <- as.numeric(as.character(tmp$x)) > >tmp >as.numeric(tmp$x) >attr(tmp$x, "position") > >bwplot(y ~ x, data=tmp) > >panel.bwplot.position <- function(x, y, ..., x.at) { > for (x.i in x.at) { > y.i <- y[x.i==x] > panel.bwplot(rep(x.i, length(y.i)), y.i, ...) > } > } > >bwplot.position <- function(formula, data, ..., x.at) { > if (missing(x.at)) { > x.name <- dimnames(attr(terms(formula),"factors"))[[2]] > x.at <- attr(data[[x.name]], "position") > } > bwplot(formula, data, ..., > x.at=x.at, > panel=panel.bwplot.position, > scales=list(x=list(at=x.at, limits=x.at+c(-1,1)))) >} > >bwplot.position(y ~ x, data=tmp) > > >## The above is a simplified version of >## panel.bwplot.intermediate.hh >## in the online files for >## Statistical Analysis and Data Display >## Richard M. Heiberger and Burt Holland >## http://springeronline.com/0-387-40270-5 >## >## An example of a boxplot with both placement and color of the boxes >## under user control is in >## >## http://astro.ocis.temple.edu/~rmh/HH/bwplot-color.pdf >Richard, I recognized your solution already last time you mentioned it and I am thinking about a similar one, (ab)using the names attribute. In principle it seems easy to solve this kind of problems with additional attributes, but without defining a new class and corresponding methods additional attributes get easily lost when indexing/subsetting. The names attribute seems to be rather "resistent". As far as I see, it survives indexing/subsetting and even sorting and this seems to be true also for factors. Greetings, Heinz