thr3ads.net - R help - [R] maintaining variable types in data frames [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Mike Miller

2009-Jan-22 17:36 UTC

[R] maintaining variable types in data frames

Suppose X and Y are two data frames with the same structures, variable 
names and dimensions but with different data and different patterns of 
missing.  I want to replace missing values in Y with corresponding values 
from X.  I'll construct a simple two-by-two case:
> X <- as.data.frame(matrix(c("a","b",1,2),2,2),
stringsAsFactors=FALSE)
> X[,2] <- as.integer(X[,2])
> str(X)'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "a" "b"
   $ V2: int  1 2
> Y <- as.data.frame(matrix(c("c","d",NA,4),2,2),
stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  NA 4

This seems to be what I want to do...
> Y[is.na(Y)] <- X[is.na(Y)]
...and it works except that the structure of Y is changed so that Y$V2 is 
now of type chr instead of type int:
> str(Y)'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: chr  "1" "4"

This behavior makes sense because the vector X[is.na(Y)] is of the 
character type:
> is.character(X[is.na(Y)])
[1] TRUE> str(X[is.na(Y)])
   chr "1"> X[is.na(Y)][1] "1"

The last couple of results seem weird at first.  The "1" was
originally an
integer but now it is a character.  This *must* be because the typing is 
done at an earlier stage in the process, back when R decides which 
elements of X have to be checked against the logical matrix is.na(Y).  It 
then decides the type for the vector and only afterward does it find that 
only one of the four elements of X will be selected, but it was prepared 
from that early stage for any of the four, even all four of them, to be 
selected.

Suppose there were no NA elements in Y, what should we expect to see if we 
repeat what we did above?
> Y <- as.data.frame(matrix(c("c","d",3,4),2,2),
stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  3 4

Even though there are no elements in X[is.na(Y)], the null element is of 
type chr:
> is.vector(X[is.na(Y)])
[1] TRUE> is.character(X[is.na(Y)])
[1] TRUE> str(X[is.na(Y)])
   chr(0)> X[is.na(Y)]character(0)

So what happens if we do this...
> Y[is.na(Y)] <- X[is.na(Y)]
...will it change the structure of Y so that Y$V2 becomes type chr?
> str(Y)'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  3 4

No.  I think there is an obvious reason for that:  Y was not changed, and 
more specifically, Y$V2 was not changed, so no change was made to the 
variable types.

It all makes sense, but I want an easy way to maintain the structure of a 
data frame when I do this kind of operation. I ought to be able to do 
something like this:

Ytypes <- get_types(Y)

Y[is.na(Y)] <- X[is.na(Y)]

use_types(Y, Ytypes)

That kind of system would ensure that the basic structure of the data 
frame can be maintained.  I don't want to have to check by hand, and 
sometimes it would be impossible to do so.

So what's the trick?  Is there a trick?

Mike

Mike Miller

2009-Jan-23 03:44 UTC

head link

[R] maintaining variable types in data frames

On Thu, 22 Jan 2009, Mike Miller wrote:
> Suppose X and Y are two data frames with the same structures, variable 
> names and dimensions but with different data and different patterns of 
> missing.  I want to replace missing values in Y with corresponding 
> values from X.  I'll construct a simple two-by-two case:
>
>> X <- as.data.frame(matrix(c("a","b",1,2),2,2),
stringsAsFactors=FALSE)
>> X[,2] <- as.integer(X[,2])
>> str(X)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "a" "b"
>  $ V2: int  1 2
>
>> Y <- as.data.frame(matrix(c("c","d",NA,4),2,2),
stringsAsFactors=FALSE)
>> Y[,2] <- as.integer(Y[,2])
>> str(Y)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "c" "d"
>  $ V2: int  NA 4
>
> This seems to be what I want to do...
>
>> Y[is.na(Y)] <- X[is.na(Y)]
>
> ...and it works except that the structure of Y is changed so that Y$V2 is
now
> of type chr instead of type int:
>
>> str(Y)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "c" "d"
>  $ V2: chr  "1" "4"

I figured out a good answer.  We can just decide the list of columns we 
want to work with and then use a for loop.  This avoids problems with 
changing variable types:

cols <- 38:47
keep <- is.na(Y)
for (i in cols) { nas <- which(keep[,i]); if ( length(nas) > 0 ) {
Y[nas,i] <- X[nas,i] }}

Something like that makes for a good one-liner on the interactive command 
line, but this looks neater in a script:

cols <- 38:47
keep <- is.na(Y)
for (i in cols) {
     nas <- which(keep[,i])
     if ( length(nas) > 0 ) {
        Y[nas,i] <- X[nas,i]
      }
   }

It shouldn't be too hard to write a function that does that kind of thing.

The only problem I know of is that if X and Y don't have exactly the same 
levels for factors, if there are factors, there could be problems.  It 
would probably take a few more lines to deal with this

A couple of people wrote to me with helpful suggestions, but no one had a 
really great, established kind of solution.  I'm a little surprised.  But, 
with an average of 125 messages per day (!) on this list, I shouldn't be 
surprised that a long message like this one won't be read by everyone.

Best,
Mike

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jan 2009 - maintaining variable types in data frames

[R] maintaining variable types in data frames

[R] maintaining variable types in data frames

Apparently Analagous Threads