Greetings Everybody: I generated a 1.2MB dta file based on the general social survey with Stata8 for linux. The file can be re-opened with Stata, but when I bring it into R, it says all the values are missing for most of the variables. This dataset is called "morgen.dta" and I dropped a copy online in case you are interested http://www.ku.edu/~pauljohn/R/morgen.dta looks like this to R (I tried various options on the read.dta command): > myDat <- read.dta("morgen.dta") > summary(myDat) CASEID year id hrs1 hrs2 Min. : 19721 Min. :1972 Min. : 1 NAP : 0 NAP : 0 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 DK : 0 DK : 0 Median : 1996808 Median :1987 Median : 905 NA : 0 NA : 0 Mean : 9963040 Mean :1986 Mean : 990 NA's:40933 NA's:40933 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358 Max. :20002817 Max. :2000 Max. :3247 prestige agewed age educ paeduc DK,NA,NAP: 0 NAP : 0 DK : 0 NAP : 0 NAP : 0 NA's :40933 DK : 0 NA : 0 DK : 0 DK : 0 NA : 0 NA's:40933 NA : 0 NA : 0 NA's:40933 NA's:40933 NA's:40933 maeduc speduc income NAP : 0 NAP : 0 $25000 OR MORE:14525 DK : 0 DK : 0 $10000 - 14999: 5022 NA : 0 NA : 0 $15000 - 19999: 3869 NA's:40933 NA's:40933 $20000 - 24999: 3664 REFUSED : 1877 (Other) : 8523 NA's : 3453 > Here's what Stata sees when I load the same thing: summarize, detail Case identification number ------------------------------------------------------------- Percentiles Smallest 1% 197432 19721 5% 199649 19722 10% 1974116 19723 Obs 40933 25% 1983475 19724 Sum of Wgt. 40933 50% 1996808 Mean 9963040 Largest Std. Dev. 9006352 75% 1.99e+07 2.00e+07 90% 2.00e+07 2.00e+07 Variance 8.11e+13 95% 2.00e+07 2.00e+07 Skewness .18931 99% 2.00e+07 2.00e+07 Kurtosis 1.045409 GSS YEAR FOR THIS RESPONDENT ------------------------------------------------------------- Percentiles Smallest 1% 1972 1972 5% 1973 1972 10% 1974 1972 Obs 40933 25% 1978 1972 Sum of Wgt. 40933 50% 1987 Mean 1986.421 Largest Std. Dev. 8.61136 75% 1994 2000 90% 1998 2000 Variance 74.15552 95% 2000 2000 Skewness -.0789223 99% 2000 2000 Kurtosis 1.799939 RESPONDENT ID NUMBER ------------------------------------------------------------- Percentiles Smallest 1% 18 1 5% 89 1 10% 178 1 Obs 40933 25% 445 1 Sum of Wgt. 40933 50% 905 Mean 989.9129 Largest Std. Dev. 689.0596 75% 1358 3244 90% 2027 3245 Variance 474803.2 95% 2437 3246 Skewness .8359211 99% 2867 3247 Kurtosis 3.311248 NUMBER OF HOURS WORKED LAST WEEK ------------------------------------------------------------- Percentiles Smallest 1% 6 0 5% 15 0 10% 21 0 Obs 23279 25% 37 0 Sum of Wgt. 23279 50% 40 Mean 41.05206 Largest Std. Dev. 13.95931 75% 48 89 90% 60 89 Variance 194.8624 95% 65 89 Skewness .195045 99% 82 89 Kurtosis 4.448998 NUMBER OF HOURS USUALLY WORK A WEEK ------------------------------------------------------------- Percentiles Smallest 1% 4 0 5% 15 0 10% 20 1 Obs 774 25% 38 2 Sum of Wgt. 774 50% 40 Mean 39.79199 Largest Std. Dev. 13.43383 75% 45 89 90% 55 89 Variance 180.4677 95% 60 89 Skewness -.0002332 99% 80 89 Kurtosis 5.009869 RS OCCUPATIONAL PRESTIGE SCORE (1970) ------------------------------------------------------------- Percentiles Smallest 1% 14 12 5% 17 12 10% 20 12 Obs 24267 25% 30 12 Sum of Wgt. 24267 50% 39 Mean 39.35645 Largest Std. Dev. 14.03712 75% 48 82 90% 60 82 Variance 197.0407 95% 62 82 Skewness .2927414 99% 76 82 Kurtosis 2.775553 AGE WHEN FIRST MARRIED ------------------------------------------------------------- Percentiles Smallest 1% 15 12 5% 17 12 10% 17 12 Obs 25382 25% 19 12 Sum of Wgt. 25382 50% 21 Mean 22.09609 Largest Std. Dev. 4.813944 75% 24 63 90% 28 68 Variance 23.17405 95% 31 73 Skewness 2.002265 99% 39 73 Kurtosis 11.28279 AGE OF RESPONDENT ------------------------------------------------------------- Percentiles Smallest 1% 19 18 5% 21 18 10% 24 18 Obs 40790 25% 30 18 Sum of Wgt. 40790 50% 42 Mean 45.14798 Largest Std. Dev. 17.53519 75% 58 89 90% 71 89 Variance 307.4828 95% 77 89 Skewness .4774907 99% 86 89 Kurtosis 2.239618 HIGHEST YEAR OF SCHOOL COMPLETED ------------------------------------------------------------- Percentiles Smallest 1% 3 0 5% 7 0 10% 8 0 Obs 40806 25% 11 0 Sum of Wgt. 40806 50% 12 Mean 12.48152 Largest Std. Dev. 3.176226 75% 14 20 90% 16 20 Variance 10.08841 95% 18 20 Skewness -.3389303 99% 20 20 Kurtosis 3.960311 HIGHEST YEAR SCHOOL COMPLETED, FATHER ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 3 0 10% 4 0 Obs 29347 25% 8 0 Sum of Wgt. 29347 50% 11 Mean 10.20994 Largest Std. Dev. 4.342143 75% 12 20 90% 16 20 Variance 18.85421 95% 17 20 Skewness -.1628909 99% 20 20 Kurtosis 2.826482 HIGHEST YEAR SCHOOL COMPLETED, MOTHER ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 3 0 10% 6 0 Obs 34151 25% 8 0 Sum of Wgt. 34151 50% 12 Mean 10.41478 Largest Std. Dev. 3.709352 75% 12 20 90% 14 20 Variance 13.75929 95% 16 20 Skewness -.6324499 99% 18 20 Kurtosis 3.605715 HIGHEST YEAR SCHOOL COMPLETED, SPOUSE ------------------------------------------------------------- Percentiles Smallest 1% 4 0 5% 7 0 10% 8 0 Obs 22780 25% 12 0 Sum of Wgt. 22780 50% 12 Mean 12.53095 Largest Std. Dev. 3.103418 75% 14 20 90% 16 20 Variance 9.631203 95% 18 20 Skewness -.287755 99% 20 20 Kurtosis 4.051822 TOTAL FAMILY INCOME ------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 3 1 10% 5 1 Obs 37480 25% 9 1 Sum of Wgt. 37480 50% 11 Mean 9.75619 Largest Std. Dev. 2.994967 75% 12 13 90% 12 13 Variance 8.969825 95% 13 13 Skewness -1.29205 99% 13 13 Kurtosis 3.759778 . -- Paul E. Johnson email: pauljohn at ku.edu Dept. of Political Science http://lark.cc.ku.edu/~pauljohn 1541 Lilac Lane, Rm 504 University of Kansas Office: (785) 864-9086 Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
On Tue, 21 Sep 2004, Paul Johnson wrote:> Greetings Everybody: > > I generated a 1.2MB dta file based on the general social survey with Stata8 > for linux. The file can be re-opened with Stata, but when I bring it into R, > it says all the values are missing for most of the variables.You need read.dta( ,convert.factors=FALSE) You have variables with labels for some, but not all, of their values. When these are converted to R factors you lose the unlabelled values. R does not have a data type that is sometimes labelled and sometimes numeric. When you use convert.factors=FALSE the label information is still read in and returned as an attribute of the data frame, so you can set individual variables to be factors. -thomas> > This dataset is called "morgen.dta" and I dropped a copy online in case you > are interested > > http://www.ku.edu/~pauljohn/R/morgen.dta > > looks like this to R (I tried various options on the read.dta command): > >> myDat <- read.dta("morgen.dta") >> summary(myDat) > CASEID year id hrs1 hrs2 > Min. : 19721 Min. :1972 Min. : 1 NAP : 0 NAP : 0 > 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 DK : 0 DK : 0 > Median : 1996808 Median :1987 Median : 905 NA : 0 NA : 0 > Mean : 9963040 Mean :1986 Mean : 990 NA's:40933 NA's:40933 > 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358 > Max. :20002817 Max. :2000 Max. :3247 > > prestige agewed age educ paeduc > DK,NA,NAP: 0 NAP : 0 DK : 0 NAP : 0 NAP : 0 > NA's :40933 DK : 0 NA : 0 DK : 0 DK : 0 > NA : 0 NA's:40933 NA : 0 NA : 0 > NA's:40933 NA's:40933 NA's:40933 > > > > maeduc speduc income > NAP : 0 NAP : 0 $25000 OR MORE:14525 > DK : 0 DK : 0 $10000 - 14999: 5022 > NA : 0 NA : 0 $15000 - 19999: 3869 > NA's:40933 NA's:40933 $20000 - 24999: 3664 > REFUSED : 1877 > (Other) : 8523 > NA's : 3453 >> > > > Here's what Stata sees when I load the same thing: > > summarize, detail > > Case identification number > ------------------------------------------------------------- > Percentiles Smallest > 1% 197432 19721 > 5% 199649 19722 > 10% 1974116 19723 Obs 40933 > 25% 1983475 19724 Sum of Wgt. 40933 > > 50% 1996808 Mean 9963040 > Largest Std. Dev. 9006352 > 75% 1.99e+07 2.00e+07 > 90% 2.00e+07 2.00e+07 Variance 8.11e+13 > 95% 2.00e+07 2.00e+07 Skewness .18931 > 99% 2.00e+07 2.00e+07 Kurtosis 1.045409 > > GSS YEAR FOR THIS RESPONDENT > ------------------------------------------------------------- > Percentiles Smallest > 1% 1972 1972 > 5% 1973 1972 > 10% 1974 1972 Obs 40933 > 25% 1978 1972 Sum of Wgt. 40933 > > 50% 1987 Mean 1986.421 > Largest Std. Dev. 8.61136 > 75% 1994 2000 > 90% 1998 2000 Variance 74.15552 > 95% 2000 2000 Skewness -.0789223 > 99% 2000 2000 Kurtosis 1.799939 > > RESPONDENT ID NUMBER > ------------------------------------------------------------- > Percentiles Smallest > 1% 18 1 > 5% 89 1 > 10% 178 1 Obs 40933 > 25% 445 1 Sum of Wgt. 40933 > > 50% 905 Mean 989.9129 > Largest Std. Dev. 689.0596 > 75% 1358 3244 > 90% 2027 3245 Variance 474803.2 > 95% 2437 3246 Skewness .8359211 > 99% 2867 3247 Kurtosis 3.311248 > > NUMBER OF HOURS WORKED LAST WEEK > ------------------------------------------------------------- > Percentiles Smallest > 1% 6 0 > 5% 15 0 > 10% 21 0 Obs 23279 > 25% 37 0 Sum of Wgt. 23279 > > 50% 40 Mean 41.05206 > Largest Std. Dev. 13.95931 > 75% 48 89 > 90% 60 89 Variance 194.8624 > 95% 65 89 Skewness .195045 > 99% 82 89 Kurtosis 4.448998 > > NUMBER OF HOURS USUALLY WORK A WEEK > ------------------------------------------------------------- > Percentiles Smallest > 1% 4 0 > 5% 15 0 > 10% 20 1 Obs 774 > 25% 38 2 Sum of Wgt. 774 > > 50% 40 Mean 39.79199 > Largest Std. Dev. 13.43383 > 75% 45 89 > 90% 55 89 Variance 180.4677 > 95% 60 89 Skewness -.0002332 > 99% 80 89 Kurtosis 5.009869 > > RS OCCUPATIONAL PRESTIGE SCORE (1970) > ------------------------------------------------------------- > Percentiles Smallest > 1% 14 12 > 5% 17 12 > 10% 20 12 Obs 24267 > 25% 30 12 Sum of Wgt. 24267 > > 50% 39 Mean 39.35645 > Largest Std. Dev. 14.03712 > 75% 48 82 > 90% 60 82 Variance 197.0407 > 95% 62 82 Skewness .2927414 > 99% 76 82 Kurtosis 2.775553 > > AGE WHEN FIRST MARRIED > ------------------------------------------------------------- > Percentiles Smallest > 1% 15 12 > 5% 17 12 > 10% 17 12 Obs 25382 > 25% 19 12 Sum of Wgt. 25382 > > 50% 21 Mean 22.09609 > Largest Std. Dev. 4.813944 > 75% 24 63 > 90% 28 68 Variance 23.17405 > 95% 31 73 Skewness 2.002265 > 99% 39 73 Kurtosis 11.28279 > > AGE OF RESPONDENT > ------------------------------------------------------------- > Percentiles Smallest > 1% 19 18 > 5% 21 18 > 10% 24 18 Obs 40790 > 25% 30 18 Sum of Wgt. 40790 > > 50% 42 Mean 45.14798 > Largest Std. Dev. 17.53519 > 75% 58 89 > 90% 71 89 Variance 307.4828 > 95% 77 89 Skewness .4774907 > 99% 86 89 Kurtosis 2.239618 > > HIGHEST YEAR OF SCHOOL COMPLETED > ------------------------------------------------------------- > Percentiles Smallest > 1% 3 0 > 5% 7 0 > 10% 8 0 Obs 40806 > 25% 11 0 Sum of Wgt. 40806 > > 50% 12 Mean 12.48152 > Largest Std. Dev. 3.176226 > 75% 14 20 > 90% 16 20 Variance 10.08841 > 95% 18 20 Skewness -.3389303 > 99% 20 20 Kurtosis 3.960311 > > HIGHEST YEAR SCHOOL COMPLETED, FATHER > ------------------------------------------------------------- > Percentiles Smallest > 1% 0 0 > 5% 3 0 > 10% 4 0 Obs 29347 > 25% 8 0 Sum of Wgt. 29347 > > 50% 11 Mean 10.20994 > Largest Std. Dev. 4.342143 > 75% 12 20 > 90% 16 20 Variance 18.85421 > 95% 17 20 Skewness -.1628909 > 99% 20 20 Kurtosis 2.826482 > > HIGHEST YEAR SCHOOL COMPLETED, MOTHER > ------------------------------------------------------------- > Percentiles Smallest > 1% 0 0 > 5% 3 0 > 10% 6 0 Obs 34151 > 25% 8 0 Sum of Wgt. 34151 > > 50% 12 Mean 10.41478 > Largest Std. Dev. 3.709352 > 75% 12 20 > 90% 14 20 Variance 13.75929 > 95% 16 20 Skewness -.6324499 > 99% 18 20 Kurtosis 3.605715 > > HIGHEST YEAR SCHOOL COMPLETED, SPOUSE > ------------------------------------------------------------- > Percentiles Smallest > 1% 4 0 > 5% 7 0 > 10% 8 0 Obs 22780 > 25% 12 0 Sum of Wgt. 22780 > > 50% 12 Mean 12.53095 > Largest Std. Dev. 3.103418 > 75% 14 20 > 90% 16 20 Variance 9.631203 > 95% 18 20 Skewness -.287755 > 99% 20 20 Kurtosis 4.051822 > > TOTAL FAMILY INCOME > ------------------------------------------------------------- > Percentiles Smallest > 1% 1 1 > 5% 3 1 > 10% 5 1 Obs 37480 > 25% 9 1 Sum of Wgt. 37480 > > 50% 11 Mean 9.75619 > Largest Std. Dev. 2.994967 > 75% 12 13 > 90% 12 13 Variance 8.969825 > 95% 13 13 Skewness -1.29205 > 99% 13 13 Kurtosis 3.759778 > > . > > > -- > Paul E. Johnson email: pauljohn at ku.edu > Dept. of Political Science http://lark.cc.ku.edu/~pauljohn > 1541 Lilac Lane, Rm 504 > University of Kansas Office: (785) 864-9086 > Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
I've had a similar problem once. What may have caused the problem then was a variate for which value lables had been defined for the highest and lowest values. What complicates things is that the file had been originally converted from SPSS to Stata. A workaround was to set "convert.factor=FALSE" and that seems to work here too (using R 1.91 and the latest update for foreign):> m2<-read.dta("morgen.dta",convert.factors=FALSE) > summary(m2)CASEID year id hrs1 Min. : 19721 Min. :1972 Min. : 1 Min. : 0.00 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 1st Qu.: 37.00 Median : 1996808 Median :1987 Median : 905 Median : 40.00 Mean : 9963040 Mean :1986 Mean : 990 Mean : 41.05 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358 3rd Qu.: 48.00 Max. :20002817 Max. :2000 Max. :3247 Max. : 89.00 NA's :17654.00 hrs2 prestige agewed age Min. : 0.00 Min. : 12.00 Min. : 12.00 Min. : 18.00 1st Qu.: 38.00 1st Qu.: 30.00 1st Qu.: 19.00 1st Qu.: 30.00 Median : 40.00 Median : 39.00 Median : 21.00 Median : 42.00 Mean : 39.79 Mean : 39.36 Mean : 22.10 Mean : 45.15 3rd Qu.: 45.00 3rd Qu.: 48.00 3rd Qu.: 24.00 3rd Qu.: 58.00 Max. : 89.00 Max. : 82.00 Max. : 73.00 Max. : 89.00 NA's :40159.00 NA's :16666.00 NA's :15551.00 NA's :143.00 educ paeduc maeduc speduc Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 11.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 12.00 Median : 12.00 Median : 11.00 Median : 12.00 Median : 12.00 Mean : 12.48 Mean : 10.21 Mean : 10.41 Mean : 12.53 3rd Qu.: 14.00 3rd Qu.: 12.00 3rd Qu.: 12.00 3rd Qu.: 14.00 Max. : 20.00 Max. : 20.00 Max. : 20.00 Max. : 20.00 NA's :127.00 NA's :11586.00 NA's :6782.00 NA's :18153.00 income Min. : 1.000 1st Qu.: 9.000 Median : 11.000 Mean : 9.756 3rd Qu.: 12.000 Max. : 13.000 NA's :3453.000>--- Paul Johnson <pauljohn at ku.edu> wrote:> Greetings Everybody: > > I generated a 1.2MB dta file based on the general social survey > with > Stata8 for linux. The file can be re-opened with Stata, but when I > bring > it into R, it says all the values are missing for most of the > variables. > > This dataset is called "morgen.dta" and I dropped a copy online in > case > you are interested > > http://www.ku.edu/~pauljohn/R/morgen.dta >[snip]