Someone supplied me with a small SPSS datafile that caused a buffer
overflow and then a crash when reading it in R. It seems like a pretty
serious issue to me. Unfortunately I can't supply the dataset at hand
and I have a hard time reproducing it with a toy example. But I found
at least 2 issues that might be related.
The first one is that when the spss dataset has a 'string' variable
that is longer than 200 characters, it generates a bunch of warnings
and then additional variables in the dataset. E.g:
library(foreign)
x <-
read.spss("http://www.stat.ucla.edu/~jeroen/spss/longstring.sav");
str(x);
The second problem is that the spss dataformat allows to specify
'duplicate labels', whereas this is not allowed for factors. read.spss
does not deal with this and creates a bad factor
x <-
read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav",
use.value.labels=T);
levels(x$opinion);
which causes issues downstream. I am not sure if this is an issue in
read.spss() or as.factor(), but I guess it might be wise to try to
detect duplicate levels and assign them all with one and the same
integer value when converting to a factor.
Thank you,
Jeroen
On Wed, Feb 15, 2012 at 7:05 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:> The second problem is that the spss dataformat allows to specify > 'duplicate labels', whereas this is not allowed for factors. read.spss > does not deal with this and creates a bad factor > > x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav", > use.value.labels=T); > levels(x$opinion); > > which causes issues downstream. I am not sure if this is an issue in > read.spss() or as.factor(), but I guess it might be wise to try to > detect duplicate levels and assign them all with one and the same > integer value when converting to a factor.I think this one would be better dealt with by giving an error. SPSS value labels are just labels, so they don't map very well onto R factors, which are enumerated types. Rather than force them and lose data, I would prefer to make the user decide what to do. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland