Paul Johnson
2006-Feb-19 20:16 UTC
[R] Converting factors back to numbers. Trouble with SPSS import data
I'm using Fedora Core 4, R-2.2. The basic question is: can one recover the numerical values used in SPSS after importing data into R with read.spss from the foreign library? Here's why I ask. My colleague sent an SPSS data set. I must replicate some results she calculated in SPSS and one problem is that the numbers used in SPSS for variable values are not easily recovered in R. I'm comparing 2 imported datasets, "eldat" (read.spss with No convert-to-factors) and "eldatfac" (read.spss with convert-to-factors) If I bring in the data without conversion to factors: library(foreign) eldat <- read.spss("18CitySCBSsorted.sav", use.value.labels=F, to.data.frame=T) I can see the variable HAPPY is coded 0, 1, 2, 3. Those are the numbers that SPSS uses as contrast values when it runs a regression with HAPPY. In contrast, allow R to translate the variables with a few value labels into factors. library(foreign) eldatfac <- read.spss("18CitySCBSsorted.sav", max.value.labels=7,to.data.frame=T) Consider the first 50 observations on the variable HAPPY> f<- eldatfac$HAPPY[1:50] > f[1] Happy Happy Very happy Happy Very happy [6] Very happy Happy Very happy Happy Very happy [11] Happy Happy Not very happy Very happy Very happy [16] Happy Happy Very happy Happy Happy [21] Not very happy Happy Happy Very happy Happy [26] Happy Happy Happy Happy Happy [31] Happy Happy Happy Happy Happy [36] Happy Very happy Very happy Happy Very happy [41] Very happy Very happy Happy Very happy Very happy [46] Happy Happy Happy Very happy Very happy 6 Levels: Not happy at all Not very happy Happy Very happy ... Refused> levels(f)[1] "Not happy at all" "Not very happy" "Happy" "Very happy" [5] "Don't know" "Refused" I need the numerical values back in order to have a regression like SPSS. Isn't this what ?factor says one ought to do? Why are these all missing?> as.numeric(levels(f))[f][1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA> as.numeric(f)[1] 3 3 4 3 4 4 3 4 3 4 3 3 2 4 4 3 3 4 3 3 2 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 4 [39] 3 4 4 4 3 4 4 3 3 3 4 4 Comparing against the "as.numeric" output from the unconverted factor, I can see the levels are just one digit different.> g <- eldat$HAPPY[1:50] > as.numeric(g)[1] 2 2 3 2 3 3 2 3 2 3 2 2 1 3 3 2 2 3 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 [39] 2 3 3 3 2 3 3 2 2 2 3 3 I'm more worried about the kinds of variables that are coded irregularly 1, 3, 7, 11 in the SPSS scheme. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
Robert W. Baer, Ph.D.
2006-Feb-19 22:44 UTC
[R] Converting factors back to numbers. Trouble with SPSS importdata
Quoted directly from the FAQ (although granted I need to look this up over and over, myself. Would that it had a easily remembered wrapper function): 7.10 How do I convert factors to numeric? It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use as.numeric(as.character(f)) to get the numbers back. More efficient, but harder to remember, is as.numeric(levels(f))[as.integer(f)] In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes). ----- Original Message ----- From: "Paul Johnson" <pauljohn32 at gmail.com> To: <r-help at stat.math.ethz.ch> Sent: Sunday, February 19, 2006 2:16 PM Subject: [R] Converting factors back to numbers. Trouble with SPSS importdata> I'm using Fedora Core 4, R-2.2. > > The basic question is: can one recover the numerical values used in > SPSS after importing data into R with read.spss from the foreign > library? Here's why I ask. > > My colleague sent an SPSS data set. I must replicate some results she > calculated in SPSS and one problem is that the numbers used in SPSS > for variable values are not easily recovered in R. > > I'm comparing 2 imported datasets, "eldat" (read.spss with No > convert-to-factors) and > "eldatfac" (read.spss with convert-to-factors) > > If I bring in the data without conversion to factors: > > library(foreign) > eldat <- read.spss("18CitySCBSsorted.sav", use.value.labels=F, > to.data.frame=T) > > I can see the variable HAPPY is coded 0, 1, 2, 3. Those are the > numbers that SPSS > uses as contrast values when it runs a regression with HAPPY. > > In contrast, allow R to translate the variables with a few value > labels into factors. > > library(foreign) > eldatfac <- read.spss("18CitySCBSsorted.sav", > max.value.labels=7,to.data.frame=T) > > Consider the first 50 observations on the variable HAPPY > >> f<- eldatfac$HAPPY[1:50] >> f > [1] Happy Happy Very happy Happy Very happy > [6] Very happy Happy Very happy Happy Very happy > [11] Happy Happy Not very happy Very happy Very > happy > [16] Happy Happy Very happy Happy Happy > [21] Not very happy Happy Happy Very happy Happy > [26] Happy Happy Happy Happy Happy > [31] Happy Happy Happy Happy Happy > [36] Happy Very happy Very happy Happy Very > happy > [41] Very happy Very happy Happy Very happy Very > happy > [46] Happy Happy Happy Very happy Very > happy > 6 Levels: Not happy at all Not very happy Happy Very happy ... Refused > >> levels(f) > [1] "Not happy at all" "Not very happy" "Happy" "Very happy" > [5] "Don't know" "Refused" > > > I need the numerical values back in order to have a regression like > SPSS. Isn't this what ?factor says one ought to do? Why are these all > missing? > >> as.numeric(levels(f))[f] > [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > NA NA > [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > NA NA > > >> as.numeric(f) > [1] 3 3 4 3 4 4 3 4 3 4 3 3 2 4 4 3 3 4 3 3 2 3 3 4 3 3 3 3 3 3 3 3 3 3 3 > 3 4 4 > [39] 3 4 4 4 3 4 4 3 3 3 4 4 > > Comparing against the "as.numeric" output from the unconverted factor, > I can see the levels are just one digit different. > >> g <- eldat$HAPPY[1:50] >> as.numeric(g) > [1] 2 2 3 2 3 3 2 3 2 3 2 2 1 3 3 2 2 3 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 > 2 3 3 > [39] 2 3 3 3 2 3 3 2 2 2 3 3 > > I'm more worried about the kinds of variables that are coded > irregularly 1, 3, 7, 11 in the SPSS scheme. > > -- > Paul E. Johnson > Professor, Political Science > 1541 Lilac Lane, Room 504 > University of Kansas > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Thomas Lumley
2006-Feb-20 01:16 UTC
[R] Converting factors back to numbers. Trouble with SPSS import data
On Sun, 19 Feb 2006, Paul Johnson wrote:> I'm using Fedora Core 4, R-2.2. > > The basic question is: can one recover the numerical values used in > SPSS after importing data into R with read.spss from the foreign > library? Here's why I ask. > > My colleague sent an SPSS data set. I must replicate some results she > calculated in SPSS and one problem is that the numbers used in SPSS > for variable values are not easily recovered in R. > > I'm comparing 2 imported datasets, "eldat" (read.spss with No > convert-to-factors) and > "eldatfac" (read.spss with convert-to-factors) > > If I bring in the data without conversion to factors: > > library(foreign) > eldat <- read.spss("18CitySCBSsorted.sav", use.value.labels=F, > to.data.frame=T) > > I can see the variable HAPPY is coded 0, 1, 2, 3. Those are the > numbers that SPSS > uses as contrast values when it runs a regression with HAPPY.So, bring in the data without conversion to factors. Factors in R are not just labels for arbitrary numeric variables. They are a special type of variable for categorical data that happen to be implemented with the numbers 1,2,3,... If that isn't what you want, don't use factors. read.spss will still return all the labels as attributes of the returned data frame.> In contrast, allow R to translate the variables with a few value > labels into factors. > > library(foreign) > eldatfac <- read.spss("18CitySCBSsorted.sav", > max.value.labels=7,to.data.frame=T) > > Consider the first 50 observations on the variable HAPPY > >> f<- eldatfac$HAPPY[1:50] >> f > [1] Happy Happy Very happy Happy Very happy > [6] Very happy Happy Very happy Happy Very happy > [11] Happy Happy Not very happy Very happy Very happy > [16] Happy Happy Very happy Happy Happy > [21] Not very happy Happy Happy Very happy Happy > [26] Happy Happy Happy Happy Happy > [31] Happy Happy Happy Happy Happy > [36] Happy Very happy Very happy Happy Very happy > [41] Very happy Very happy Happy Very happy Very happy > [46] Happy Happy Happy Very happy Very happy > 6 Levels: Not happy at all Not very happy Happy Very happy ... Refused > >> levels(f) > [1] "Not happy at all" "Not very happy" "Happy" "Very happy" > [5] "Don't know" "Refused" > > > I need the numerical values back in order to have a regression like > SPSS. Isn't this what ?factor says one ought to do? Why are these all > missing? > >> as.numeric(levels(f))[f] > [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NANANo, this is not what ?factor says you should do. This is what you do if your levels are numbers (in character form) and you want those numbers. "Happy" is not a number.>> as.numeric(f) > [1] 3 3 4 3 4 4 3 4 3 4 3 3 2 4 4 3 3 4 3 3 2 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 4 > [39] 3 4 4 4 3 4 4 3 3 3 4 4 > > Comparing against the "as.numeric" output from the unconverted factor, > I can see the levels are just one digit different.Yes, because SPSS used the codes 0,1,2,3 and R uses 1,2,3,4. You could just subtract 1 if you want the numbers to be smaller by 1.>> g <- eldat$HAPPY[1:50] >> as.numeric(g) > [1] 2 2 3 2 3 3 2 3 2 3 2 2 1 3 3 2 2 3 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 > [39] 2 3 3 3 2 3 3 2 2 2 3 3 > > I'm more worried about the kinds of variables that are coded > irregularly 1, 3, 7, 11 in the SPSS scheme. >If you want to keep the numeric values, don't change them to factors. That's why there is an option. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Possibly Parallel Threads
- SPSS data import: problems & work arounds for GSS surveys
- read.spss: option "to.data.frame" and string variables
- foreign(read.spss) in rw2000 and re2001beta
- re ad.spss (foreign) conflict with SPSS 17 files.
- importing explicitly declared missing values in read.spss (foreign)