Paul Johnson
2009-Mar-03 03:57 UTC
[R] SPSS data import: problems & work arounds for GSS surveys
I'm using R 2.8.1 on Ubuntu 8.10. I'm writing partly to ask what's wrong, partly to tell other users who search that there is a work around. The General Social Survey is a long standing series of surveys provided by NORC (National Opinion Research Center). I have downloaded some years of the survey data in SPSS format (here's the site: http://www.norc.org/GSS+Website/Download/SPSS+Format/). When I try to import using foreign, I get an error like so:> library(foreign) > dat <- read.spss("gss2006.sav", to.data.frame=T, trim.factor.names=T)Error in inherits(x, "factor") : object "cp" not found In addition: Warning messages: 1: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable TVRELIG 2: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable SEI 3: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable FIRSTSEI 4: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable PASEI 5: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable MASEI 6: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 99.9 for variable SPSEI 7: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File contains duplicate label for value 0.75 for variable YEARSJOB 8: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : gss2006.sav: File-indicated character representation code (1252) looks like a Windows codepage No dat object is created from this. I have found a work around. I installed PSPP version 0.6.0 and used it to open the sav file, and then re-save it in SPSS sav format. That creates an SPSS file that foreign's function can open. I still see the warnings about redundant value labels, but as far as I can see these are harmless. A working object is obtained like so:> dat <- read.spss("gss-pspp.sav")Warning messages: 1: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable TVRELIG 2: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 0.75 for variable YEARSJOB 3: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable SEI 4: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable FIRSTSEI 5: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable PASEI 6: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable MASEI 7: In read.spss("gss-pspp.sav") : gss-pspp.sav: File contains duplicate label for value 99.9 for variable SPSEI There is still some trouble with the importation of this SPSS file, however. It has the symptoms of being a non-rectangular data array, I think. What do you think about these warnings:> dat <- read.spss("gss-pspp.sav",to.data.frame=T)There were 22 warnings (use warnings() to see them)> warnings()Warning messages: 1: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable TVRELIG 2: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 0.75 for variable YEARSJOB 3: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable SEI 4: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable FIRSTSEI 5: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable PASEI 6: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable MASEI 7: In read.spss("gss-pspp.sav", to.data.frame = T) : gss-pspp.sav: File contains duplicate label for value 99.9 for variable SPSEI 8: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 9: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 10: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 11: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 12: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 13: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 14: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 15: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 16: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 17: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 18: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 19: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 20: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 21: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 22: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length While puzzling over this, I have tested the SPSS functions in the package memisc. This has some truly handy features! Read ?importer and you'll see it can generate a list of variables as well as a codebook. It can also handle an SPSS portable file. Importer works a little bit like SPSS, actually, because the metadata is accessed, but the data is not really loaded until later (as far as I can tell, one must run either subset or as.data.set to force the actual data read). One can generate the description and codebook without accessing the data.> idat <- spss.system.file("gss2006.sav") > show(idat)SPSS system file 'gss2006.sav' with 5137 variables and 4510 observations A subset function can access the particular variables from the data.> idat2 <- subset(idat, select=c(gunlaw)) > idat2Data set with 4510 observations and 1 variables gunlaw 1 OPPOSE 2 *NAP 3 *NAP 4 FAVOR 5 FAVOR 6 *NAP 7 FAVOR 8 *NAP 9 FAVOR 10 FAVOR 11 FAVOR 12 FAVOR 13 FAVOR 14 *NAP 15 *NAP 16 *NAP 17 FAVOR 18 *NAP 19 FAVOR 20 *NAP 21 *NAP 22 OPPOSE 23 *NAP 24 *NAP 25 *NAP .. ...... (25 of 4510 observations shown) and the function "as.data.set" will force a full read of all the data columns:> idat3 <- as.data.set(idat) >> table(idat3$gunlaw, idat2$gunlaw)0 1 2 8 9 0 2507 0 0 0 0 1 0 1568 0 0 0 2 0 0 395 0 0 8 0 0 0 35 0 9 0 0 0 0 5 So, in conclusion, I've found troubles with read.spss in foreign, but have been able to work around that by accessing data with PSPP or the functions from the memisc package. The only advantage of using the PSPS program (its GUI is psppire) is that you can see the data in a rectangular spreadsheet that is more-or-less searchable. It has that same hard-to-use interface pioneered at SPSS (it hides variable names and displays descriptions in choosers). But the rectangular display in PSPP is nice. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
John Fox
2009-Mar-03 13:43 UTC
[R] SPSS data import: problems & work arounds for GSS surveys
Dear Paul, I encountered this problem the other day, and it went away when I updated the foreign package from version 0.8-32 to 0.8-33. I hope this helps, John ------------------------------ John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario, Canada web: socserv.mcmaster.ca/jfox> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]On> Behalf Of Paul Johnson > Sent: March-02-09 10:58 PM > To: R-help > Subject: [R] SPSS data import: problems & work arounds for GSS surveys > > I'm using R 2.8.1 on Ubuntu 8.10. I'm writing partly to ask what's > wrong, partly to tell other users who search that there is a work > around. > > The General Social Survey is a long standing series of surveys > provided by NORC (National Opinion Research Center). I have > downloaded some years of the survey data in SPSS format (here's the > site: http://www.norc.org/GSS+Website/Download/SPSS+Format/). When I > try to import using foreign, I get an error like so: > > > library(foreign) > > dat <- read.spss("gss2006.sav", to.data.frame=T, trim.factor.names=T) > Error in inherits(x, "factor") : object "cp" not found > In addition: Warning messages: > 1: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for variable > TVRELIG > 2: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for variableSEI> 3: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for > variable FIRSTSEI > 4: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for variable > PASEI > 5: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for variable > MASEI > 6: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 99.9 for variable > SPSEI > 7: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File contains duplicate label for value 0.75 for > variable YEARSJOB > 8: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) : > gss2006.sav: File-indicated character representation code (1252) > looks like a Windows codepage > > No dat object is created from this. > > > I have found a work around. I installed PSPP version 0.6.0 and used > it to open the sav file, and then re-save it in SPSS sav format. > That creates an SPSS file that foreign's function can open. > > I still see the warnings about redundant value labels, but as far as I > can see these are harmless. A working object is obtained like so: > > > dat <- read.spss("gss-pspp.sav") > Warning messages: > 1: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for > variable TVRELIG > 2: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 0.75 for > variable YEARSJOB > 3: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for variableSEI> 4: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for > variable FIRSTSEI > 5: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > PASEI > 6: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > MASEI > 7: In read.spss("gss-pspp.sav") : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > SPSEI > > > There is still some trouble with the importation of this SPSS file, > however. It has the symptoms of being a non-rectangular data array, I > think. What do you think about these warnings: > > > dat <- read.spss("gss-pspp.sav",to.data.frame=T) > There were 22 warnings (use warnings() to see them) > > warnings() > Warning messages: > 1: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for > variable TVRELIG > 2: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 0.75 for > variable YEARSJOB > 3: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for variableSEI> 4: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for > variable FIRSTSEI > 5: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > PASEI > 6: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > MASEI > 7: In read.spss("gss-pspp.sav", to.data.frame = T) : > gss-pspp.sav: File contains duplicate label for value 99.9 for variable > SPSEI > 8: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 9: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 10: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 11: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 12: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 13: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 14: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 15: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 16: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 17: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 18: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 19: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 20: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 21: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > 22: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : > longer object length is not a multiple of shorter object length > > > While puzzling over this, I have tested the SPSS functions in the > package memisc. This has some truly handy features! Read ?importer > and you'll see it can generate a list of variables as well as a > codebook. It can also handle an SPSS portable file. > Importer works a little bit like SPSS, actually, because the metadata > is accessed, but the data is not really loaded until later (as far as > I can tell, one must run either subset or as.data.set to force the > actual data read). One can generate the description and codebook > without accessing the data. > > > idat <- spss.system.file("gss2006.sav") > > show(idat) > > SPSS system file 'gss2006.sav' > with 5137 variables and 4510 observations > > A subset function can access the particular variables from the data. > > > > idat2 <- subset(idat, select=c(gunlaw)) > > idat2 > > Data set with 4510 observations and 1 variables > > gunlaw > 1 OPPOSE > 2 *NAP > 3 *NAP > 4 FAVOR > 5 FAVOR > 6 *NAP > 7 FAVOR > 8 *NAP > 9 FAVOR > 10 FAVOR > 11 FAVOR > 12 FAVOR > 13 FAVOR > 14 *NAP > 15 *NAP > 16 *NAP > 17 FAVOR > 18 *NAP > 19 FAVOR > 20 *NAP > 21 *NAP > 22 OPPOSE > 23 *NAP > 24 *NAP > 25 *NAP > .. ...... > (25 of 4510 observations shown) > > and the function "as.data.set" will force a full read of all the data > columns: > > > > idat3 <- as.data.set(idat) > > > > > table(idat3$gunlaw, idat2$gunlaw) > > 0 1 2 8 9 > 0 2507 0 0 0 0 > 1 0 1568 0 0 0 > 2 0 0 395 0 0 > 8 0 0 0 35 0 > 9 0 0 0 0 5 > > > So, in conclusion, I've found troubles with read.spss in foreign, but > have been able to work around that by accessing data with PSPP or the > functions from the memisc package. The only advantage of using the > PSPS program (its GUI is psppire) is that you can see the data in a > rectangular spreadsheet that is more-or-less searchable. It has that > same hard-to-use interface pioneered at SPSS (it hides variable names > and displays descriptions in choosers). But the rectangular display in > PSPP is nice. > > pj > > -- > Paul E. Johnson > Professor, Political Science > 1541 Lilac Lane, Room 504 > University of Kansas > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.