Paul Johnson
2004-Apr-28 06:16 UTC
[R] Possible bug in foreign library import of Stata datasets
Concerning this article, Christopher Zorn, "Generalized Estimating Equation Models for Correlated Data: A Review with Applications." 2001. American Journal of Political Science 45(April):470-90. The author very kindly provides data for replication on his web page: http://www.emory.edu/POLS/zorn/Data/GEE.zip. I've been comparing the Professor Zorn's results obtained with Stata and R. I ran into some trouble with the results in Table 2. I traced the problem back to the R foreign library's data import. Observe the variable "deml" in the Stata output: table deml ---------------------- Lower of | two | POLITY | democracy | s | Freq. ----------+----------- -10.00 | 826 -9.00 | 3,829 -8.00 | 2,161 -7.00 | 6,847 -6.00 | 541 -5.00 | 451 -4.00 | 152 -3.00 | 306 -2.00 | 145 -1.00 | 252 0.00 | 94 1.00 | 103 2.00 | 169 3.00 | 108 4.00 | 404 5.00 | 634 6.00 | 154 7.00 | 281 8.00 | 923 9.00 | 258 10.00 | 2,352 ---------------------- The negative valued observations get mixed up in R: > library(foreign) > dat2 <- read.dta("table2.dta") > table(deml) deml 0 1 2 3 4 5 6 7 8 9 10 246 247 94 103 169 108 404 634 154 281 923 258 2352 826 3829 248 249 250 251 252 253 254 255 2161 6847 541 451 152 306 145 252 The read.dta has translated the negative values as (256-deml). Is this the kind of thing that is a bug, or have I missed something in the documentation about the handling of negative numbers? Should a formal bug report be filed? -- Paul E. Johnson email: pauljohn at ku.edu Dept. of Political Science http://lark.cc.ku.edu/~pauljohn 1541 Lilac Lane, Rm 504 University of Kansas Office: (785) 864-9086 Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
Peter Dalgaard
2004-Apr-28 08:04 UTC
[R] Possible bug in foreign library import of Stata datasets
Paul Johnson <pauljohn at ku.edu> writes:> Concerning this article, Christopher Zorn, "Generalized Estimating > Equation Models for Correlated Data: A Review with Applications." > 2001. American Journal of Political Science 45(April):470-90. > > The author very kindly provides data for replication on his web page: > http://www.emory.edu/POLS/zorn/Data/GEE.zip. > > I've been comparing the Professor Zorn's results obtained with Stata > and R. I ran into some trouble with the results in Table 2. I traced > the problem back to the R foreign library's data import. Observe the > variable "deml" in the Stata output: > > > table deml > > ---------------------- > Lower of | > two | > POLITY | > democracy | > s | Freq. > ----------+----------- > -10.00 | 826 > -9.00 | 3,829 > -8.00 | 2,161 > -7.00 | 6,847 > -6.00 | 541 > -5.00 | 451 > -4.00 | 152 > -3.00 | 306 > -2.00 | 145 > -1.00 | 252 > 0.00 | 94 > 1.00 | 103 > 2.00 | 169 > 3.00 | 108 > 4.00 | 404 > 5.00 | 634 > 6.00 | 154 > 7.00 | 281 > 8.00 | 923 > 9.00 | 258 > 10.00 | 2,352 > ---------------------- > > > The negative valued observations get mixed up in R: > > > library(foreign) > > dat2 <- read.dta("table2.dta") > > table(deml) > deml > 0 1 2 3 4 5 6 7 8 9 10 246 247 > 94 103 169 108 404 634 154 281 923 258 2352 826 3829 > 248 249 250 251 252 253 254 255 > 2161 6847 541 451 152 306 145 252 > > The read.dta has translated the negative values as (256-deml). > > Is this the kind of thing that is a bug, or have I missed something in > the documentation about the handling of negative numbers? Should a > formal bug report be filed?Looks like a classic signed/unsigned confusion. Negative numbers stored in ones-complement format in single bytes, but getting interpreted as unsigned. A bug report could be a good idea if the resident Stata expert (Thomas, I believe) is unavailable just now. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
(Ted Harding)
2004-Apr-28 08:14 UTC
[R] Possible bug in foreign library import of Stata datasets
On 28-Apr-04 Paul Johnson wrote:> The negative valued observations get mixed up in R: > > > library(foreign) > > dat2 <- read.dta("table2.dta") > > table(deml) > deml > 0 1 2 3 4 5 6 7 8 9 10 246 247 > 94 103 169 108 404 634 154 281 923 258 2352 826 3829 > 248 249 250 251 252 253 254 255 > 2161 6847 541 451 152 306 145 252 > > The read.dta has translated the negative values as (256-deml). > > Is this the kind of thing that is a bug, or have I missed something in > the documentation about the handling of negative numbers? Should a > formal bug report be filed?This observation suggests a fairly clear diagnostic: the original negative numbers (tabulated as "-10.00" etc) are coming through as what C would call "signed char" -- positive for N=0 to 127, negative (N-256) for N=128 to 255, but are being interpreted as positive integers in (0,255). An unusual though feasible type. The question is where this is occurring. The Stata tabulation represents them as apparent reals; but the storage in the .dta file may be 1-byte for economy of space. If so, then whether or not this is a bug in read.dta may depend on whether the .dta file includes a "flag" for such 1-byte data that they really are intended to represent signed values (and possibly on whether there is a further flag for real versus integer types). If not, then 1-byte data will not be distinguishable from unsigned short integers, and read.dta can hardly be blamed for getting the wrong impression. Since I'm not familiar with Stata data file formats, I can't comment further! Ted.
Thomas Lumley
2004-Apr-28 15:38 UTC
[R] Possible bug in foreign library import of Stata datasets
On Wed, 28 Apr 2004, Paul Johnson wrote:> > The read.dta has translated the negative values as (256-deml). > > Is this the kind of thing that is a bug, or have I missed something in > the documentation about the handling of negative numbers? Should a > formal bug report be filed?A fixed version of the foreign package has been sent to CRAN. -thomas