Frederik Elwert
2009-Aug-26 09:55 UTC
[R] Issues with factors with duplicate (empty) levels
Hello! I imported a DJI survey[1] from an SPSS file. When looking at some of the variables, I noticed problems with the `table` function and similar. It seems to be caused by duplicate levels which are generated from the value labels. Not all values have labels, so those who don?t get an empty string as the level, which leads to duplicates. I hope the code and output below illustrates the problem. Is it possible to prevent this? I?d still like to use the labels, so using numeric vectors instead of factors is not the best solution. Regards, Frederik> library(foreign) > Data <- read.spss("js2003_16_29_db.sav", to.data.frame=TRUE,reencode="latin1")> table(Data$J203_A)?berhaupt nicht wichtig 35 2256 0 0 0 0 sehr wichtig Mehrfachnennung 4660 0> table(as.numeric(Data$J203_A))1 2 3 4 5 6 7 35 39 84 227 626 1280 4660> is.factor(Data$J203_A)[1] TRUE> levels(Data$J203_A)[1] "?berhaupt nicht wichtig" " " [3] " " " " [5] " " " " [7] "sehr wichtig" "Mehrfachnennung" [1] http://213.133.108.158/surveys/index.php?m=msw,0&sID=54 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Dies ist ein digital signierter Nachrichtenteil URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090826/64f3d71d/attachment-0002.bin>
Frederik Elwert
2009-Aug-27 12:44 UTC
[R] Issues with factors with duplicate (empty) levels
Hello again, Just for your information, I think I found a way to work around the problem described below. I don?t know if it?s the most elegant way, but it seems to work. Am Mittwoch, den 26.08.2009, 11:55 +0200 schrieb Frederik Elwert:> Hello! > > I imported a DJI survey[1] from an SPSS file. When looking at some of > the variables, I noticed problems with the `table` function and similar. > It seems to be caused by duplicate levels which are generated from the > value labels. Not all values have labels, so those who don?t get an > empty string as the level, which leads to duplicates. > > I hope the code and output below illustrates the problem. Is it possible > to prevent this? I?d still like to use the labels, so using numeric > vectors instead of factors is not the best solution. > > Regards, > Frederik > > > > library(foreign) > > Data <- read.spss("js2003_16_29_db.sav", to.data.frame=TRUE, > reencode="latin1") > > table(Data$J203_A) > > ?berhaupt nicht wichtig > 35 2256 0 > > 0 0 0 > sehr wichtig Mehrfachnennung > 4660 0 > > table(as.numeric(Data$J203_A)) > > 1 2 3 4 5 6 7 > 35 39 84 227 626 1280 4660 > > is.factor(Data$J203_A) > [1] TRUE > > levels(Data$J203_A) > [1] "?berhaupt nicht wichtig" " " > [3] " " " " > [5] " " " " > [7] "sehr wichtig" "Mehrfachnennung"for (i in 1:ncol(Data)){ if (is.factor(Data[,i])){ lvl <- levels(JS2003[,i]) if (" " %in% lvl){ empty <- lvl == " " lvl[empty] <- (1:length(lvl))[empty] levels(Data[,i]) <- lvl } } }> table(Data$J203_A)?berhaupt nicht wichtig 2 3 35 39 84 4 5 6 227 626 1280 sehr wichtig Mehrfachnennung 4660 0