thr3ads.net - R help - [R] Keep value lables with data frame manipulation [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Jol, Arne

2006-Jul-12 16:41 UTC

[R] Keep value lables with data frame manipulation

Dear R,

I import data from spss into a R data.frame. On this rawdata I do some
data processing (selection of observations, normalization, recoding of
variables etc..). The result is stored in a new data.frame, however, in
this new data.frame the value labels are lost.

Example of what I do in code:

# read raw data from spss
rawdata <- read.spss("./data/T50937.SAV",
	use.value.labels=FALSE,to.data.frame=TRUE)

# select the observations that we need
diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
 			rawdata$D22==24 | rawdata$D22==33,]

The result is that rawdata$D22 has value labels and that diarydata$D22
is numeric without value labels.

Question: How can I prevent this from happening?

Thanks in advance!
Groeten,
Arne

Marc Schwartz (via MN)

2006-Jul-12 18:14 UTC

head link

[R] Keep value lables with data frame manipulation

On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:> Dear R,
> 
> I import data from spss into a R data.frame. On this rawdata I do some
> data processing (selection of observations, normalization, recoding of
> variables etc..). The result is stored in a new data.frame, however, in
> this new data.frame the value labels are lost.
> 
> Example of what I do in code:
> 
> # read raw data from spss
> rawdata <- read.spss("./data/T50937.SAV",
> 	use.value.labels=FALSE,to.data.frame=TRUE)
> 
> # select the observations that we need
> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>  			rawdata$D22==24 | rawdata$D22==33,]
> 
> The result is that rawdata$D22 has value labels and that diarydata$D22
> is numeric without value labels.
> 
> Question: How can I prevent this from happening?
> 
> Thanks in advance!
> Groeten,
> Arne
Two things:

1. With respect to your subsetting, your lengthy code can be replaced
with the following:

  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))

See ?subset and ?"%in%" for more information.


2. With respect to keeping the label related attributes, the
'value.labels' attribute and the 'variable.labels' attribute
will not by
default survive the use of "[".data.frame in R (see ?Extract
and ?"[.data.frame").

On the other hand, based upon my review of ?read.spss, the SPSS value
labels should be converted to the factor levels of the respective
columns when 'use.value.labels = TRUE' and these would survive a
subsetting.

If you want to consider a solution to the attribute subsetting issue,
you might want to review the following post by Gabor Grothendieck in
May, which provides a possible solution:

  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html

and this post by me, for an explanation of what is happening in Gabor's
solution:

  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html

HTH,

Marc Schwartz

Heinz Tuechler

2006-Jul-13 08:59 UTC

head link

[R] Keep value lables with data frame manipulation

At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:>On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>> Dear R,
>> 
>> I import data from spss into a R data.frame. On this rawdata I do some
>> data processing (selection of observations, normalization, recoding of
>> variables etc..). The result is stored in a new data.frame, however, in
>> this new data.frame the value labels are lost.
>> 
>> Example of what I do in code:
>> 
>> # read raw data from spss
>> rawdata <- read.spss("./data/T50937.SAV",
>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>> 
>> # select the observations that we need
>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
rawdata$D22==17 |
>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>  			rawdata$D22==24 | rawdata$D22==33,]
>> 
>> The result is that rawdata$D22 has value labels and that diarydata$D22
>> is numeric without value labels.
>> 
>> Question: How can I prevent this from happening?
>> 
>> Thanks in advance!
>> Groeten,
>> Arne
>
>Two things:
>
>1. With respect to your subsetting, your lengthy code can be replaced
>with the following:
>
>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>
>See ?subset and ?"%in%" for more information.
>
>
>2. With respect to keeping the label related attributes, the
>'value.labels' attribute and the 'variable.labels' attribute
will not by
>default survive the use of "[".data.frame in R (see ?Extract
>and ?"[.data.frame").
>
>On the other hand, based upon my review of ?read.spss, the SPSS value
>labels should be converted to the factor levels of the respective
>columns when 'use.value.labels = TRUE' and these would survive a
>subsetting.
>
>If you want to consider a solution to the attribute subsetting issue,
>you might want to review the following post by Gabor Grothendieck in
>May, which provides a possible solution:
>
>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>
>and this post by me, for an explanation of what is happening in Gabor's
>solution:
>
>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>
>HTH,
>
>Marc Schwartz
>Hello Mark and Arne,

I worked on the suggestions of Gabor and Mark and programmed some functions
in this way, but they are very, very preliminary (see below).
In my view there is a lack of convenient possibilities in R to document
empirical data by variable labels, value labels, etc. I would prefer to
have these possibilities in the "standard" configuration.
So I sketched a concept, but in my view it would only be useful, if there
was some acceptance by the core developers of R.

The concept would be to define a class. For now I call it
"source.data".
To design it more flexible than the Hmisc class "labelled" I would
define a
related option "source.data.attributes" with default
c('value.labels',
'variable.name', 'label')). This option contains all attributes
that should
persist in subsetting/indexing.

I made only some very, very preliminary tests with these functions, mainly
because I am not happy with defining a new class. Instead I would prefer,
if this functionality could be integrated in the Hmisc class
"labelled",
since this is in my view the best known starting point for data
documentation in R.

I would be happy, if there were some discussion about the wishes/needs of
other Rusers concerning data documentation.

Greetings,

Heinz


### intention and concept
#   There should be a convenient possibility to keep source data numerical
#   coded and at the same time have labelled categories.
#   Such labelled categorical numerical data should be easily converted
#   to factors.
#   Indexing/subsetting should preserve the concerned attributes of this data.

### description of (intended!!!) functionality
#   - a class source.data is defined. It is intended only for atomic objects.
#   - option source.data.attributes defines which attributes will be copied
#     in indexing/subsetting objects of class source.data
#   - option source.data.is.ordered sets defining factors as ordered, when
#     built from objects of class source.data by the function factsd
#   - function 'value.labels<-' assigns an attribute value.labels and
sets
#     class source.data
#   - function value.labels reads the attribute value.labels
#   - the indexing method '[.source.data' defines indexing for
source.data
#   - the print method print.source.data ignores source.data.attributes in
#     printing
#   - the as.data.frame method as.data.frame.source.data enables inclusion
#     of objects of class source.data in data.frames
#   - function factsd should in general behave as function factor but should
#     in case of an object of class source.data by default use the
value.labels
#     as levels and the names(value.labels) as the labels of the new built
#     factor.
#     If the parameter ordered is NULL it should create ordered factors
#     according to the option source.data.is.ordered.

### set option for source.data.attributes
options(source.data.attributes=c('value.labels',
'variable.name', 'label'))
### set option for converting source.data class in ordered factors
options(source.data.is.ordered=TRUE)

### function to assign value.labels
'value.labels<-' <- function (x, value)
  ## adapted from Hmisc function label 30.6.2006
{
  if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects
only')
  structure(x, value.labels = value, class = c("source.data",
               attr(x, "class")[attr(x, "class") !=
"source.data"]))
}

### function to read value.labels
value.labels <- function (x) { attr(x, 'value.labels') }

### definition of indexing method for class=source.data
##  source.data.attributes shall be conserved
"[.source.data" <- function(x, ...)
{
  atr <- attributes(x)
  atr.names <- names(atr)
  sda <- options()$'source.data.attributes'
  sda.match <- match(atr.names, sda)
  sda.match <- sda.match[!is.na(sda.match)]
  x <- NextMethod("[")
  ## assign source.data.attributes to result
  if(length(sda.match))
    for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]]
  ## assign class source.data to result
  class(x) <- c('source.data', attr(x, "class")[attr(x,
"class")
                                                != "source.data"])
  x
}

### print method for source.data
'print.source.data' <- function (x, ...) 
{
  ## adapted from Hmisc print.labelled 31.5.2006
  x.orig <- x
  ## look if there are source.data.attributes
  sda <- options()$'source.data.attributes'
  sda.match <- match(names(attributes(x)), sda)
  sda.match <- sda.match[!is.na(sda.match)]
  ## delete source.data.attributes for printing
  if(length(sda.match))
    for (i in sda.match) attr(x, sda[i]) <- NULL
  ## delete class source.data for printing
  class(x) <- if (length(class(x)) == 1 && class(x) ==
"source.data")
    NULL
  else class(x)[class(x) != "source.data"]
  NextMethod("print")
  invisible(x.orig)
}

### Define function as.data.frame.source.data (copy from as.data.frame.vector)
#   many as.data.frame methods are identical to this
##  different functions as.data.frame are besides others:
#   as.data.frame.list, as.data.frame.default, as.data.frame.data.frame,
#   as.data.frame.character, as.data.frame.AsIs, as.data.frame.array,

as.data.frame.source.data <- 
  function (x, row.names = NULL, optional = FALSE)
  ## copy from as.data.frame.vector 1.6.2006
{
  nrows <- length(x)
  nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = "
")
  if (is.null(row.names)) {
    if (nrows == 0) 
      row.names <- character(0)
    else if (length(row.names <- names(x)) == nrows &&
!any(duplicated(row.names))) {
    }
    else if (optional) 
      row.names <- character(nrows)
    else row.names <- as.character(1:nrows)
  }
  names(x) <- NULL
  value <- list(x)
  if (!optional) 
    names(value) <- nm
  attr(value, "row.names") <- row.names
  class(value) <- "data.frame"
  value
}

### function to create factor from source.data class applying variable.labels
#   and copying all source.data.attributes
#   remark: factor(factsd(x)) drops unused factor levels and source.data class
#           factsd(x)[, drop=TRUE] drops unused factor levels but keeps
#           source.data class and attributes

factsd <- function(x = character(),
                   levels = sort(unique.default(x), na.last = TRUE),
                   labels = levels, exclude = NA, ordered = NULL)
{
  ## check if is of class source.data
  if ('source.data' %in% class(x))
    {
      if(is.null(ordered)) ordered <- options()$source.data.is.ordered
      fx <- factor(x = x, levels = value.labels(x),
                   labels = names(value.labels(x)),
                   exclude = exclude,
                   ordered = ordered)
      ## copy source.data.attributes
      atr <- attributes(x)
      atr.names <- names(atr)
      sda <- options()$'source.data.attributes'
      sda.match <- match(atr.names, sda)
      sda.match <- sda.match[!is.na(sda.match)]
      ## assign source.data.attributes to result
      if(length(sda.match))
        for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]]
      ## add class source.data to result
      class(fx) <- c('source.data', attr(fx, 'class'))       
    }
  else {
    if(is.null(ordered)) ordered <- is.ordered(x)
    fx <- factor(x = x, levels = levels, labels = labels,
                 exclude = exclude, ordered = ordered)
  }
  fx
}

Richard M. Heiberger

2006-Jul-13 16:36 UTC

head link

[R] Keep value lables with data frame manipulation

> Further I do not see a simple method to label numerical
> variables. I often encounter discrete, but still metric data, as e.g. risk
> scores. Usually it would be nice to use them in their original coding,
> which may include zero or decimal places and to label them at the same
time.
## For this specific case, I use a "position" attribute.


tmp <- data.frame(y=rnorm(30), x=factor(rep(c(0,1,2,4,8), 6)))
attr(tmp$x, "position") <- as.numeric(as.character(tmp$x))

tmp
as.numeric(tmp$x)
attr(tmp$x, "position")

bwplot(y ~ x, data=tmp)

panel.bwplot.position <- function(x, y, ..., x.at) {
         for (x.i in x.at) {
           y.i <- y[x.i==x]
           panel.bwplot(rep(x.i, length(y.i)), y.i, ...)
         }
       }

bwplot.position <- function(formula, data, ..., x.at) {
  if (missing(x.at)) {
    x.name <- dimnames(attr(terms(formula),"factors"))[[2]]
    x.at <- attr(data[[x.name]], "position")
  }
  bwplot(formula, data, ...,
         x.at=x.at,
         panel=panel.bwplot.position,
         scales=list(x=list(at=x.at, limits=x.at+c(-1,1))))
}

bwplot.position(y ~ x, data=tmp)


## The above is a simplified version of
##     panel.bwplot.intermediate.hh
## in the online files for
##                 Statistical Analysis and Data Display
##                 Richard M. Heiberger and Burt Holland
##    http://springeronline.com/0-387-40270-5
## 
## An example of a boxplot with both placement and color of the boxes
## under user control is in
## 
## http://astro.ocis.temple.edu/~rmh/HH/bwplot-color.pdf

Heinz Tuechler

2006-Jul-13 23:22 UTC

head link

[R] Keep value lables with data frame manipulation

At 12:36 13.07.2006 -0400, Richard M. Heiberger wrote:>> Further I do not see a simple method to label numerical
>> variables. I often encounter discrete, but still metric data, as e.g.
risk
>> scores. Usually it would be nice to use them in their original coding,
>> which may include zero or decimal places and to label them at the same
time.>
>## For this specific case, I use a "position" attribute.
>
>
>tmp <- data.frame(y=rnorm(30), x=factor(rep(c(0,1,2,4,8), 6)))
>attr(tmp$x, "position") <- as.numeric(as.character(tmp$x))
>
>tmp
>as.numeric(tmp$x)
>attr(tmp$x, "position")
>
>bwplot(y ~ x, data=tmp)
>
>panel.bwplot.position <- function(x, y, ..., x.at) {
>         for (x.i in x.at) {
>           y.i <- y[x.i==x]
>           panel.bwplot(rep(x.i, length(y.i)), y.i, ...)
>         }
>       }
>
>bwplot.position <- function(formula, data, ..., x.at) {
>  if (missing(x.at)) {
>    x.name <- dimnames(attr(terms(formula),"factors"))[[2]]
>    x.at <- attr(data[[x.name]], "position")
>  }
>  bwplot(formula, data, ...,
>         x.at=x.at,
>         panel=panel.bwplot.position,
>         scales=list(x=list(at=x.at, limits=x.at+c(-1,1))))
>}
>
>bwplot.position(y ~ x, data=tmp)
>
>
>## The above is a simplified version of
>##     panel.bwplot.intermediate.hh
>## in the online files for
>##                 Statistical Analysis and Data Display
>##                 Richard M. Heiberger and Burt Holland
>##    http://springeronline.com/0-387-40270-5
>## 
>## An example of a boxplot with both placement and color of the boxes
>## under user control is in
>## 
>## http://astro.ocis.temple.edu/~rmh/HH/bwplot-color.pdf
>Richard, 

I recognized your solution already last time you mentioned it and I am
thinking about a similar one, (ab)using the names attribute.
In principle it seems easy to solve this kind of problems with additional
attributes, but without defining a new class and corresponding methods
additional attributes get easily lost when indexing/subsetting.
The names attribute seems to be rather "resistent". As far as I see,
it
survives indexing/subsetting and even sorting and this seems to be true
also for factors.

Greetings,

Heinz

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jul 2006 - Keep value lables with data frame manipulation

[R] Keep value lables with data frame manipulation

[R] Keep value lables with data frame manipulation

[R] Keep value lables with data frame manipulation

[R] Keep value lables with data frame manipulation

[R] Keep value lables with data frame manipulation

Possibly Parallel Threads